Word Sense Induction with Neural biLM and Symmetric Patterns
WWord Sense Induction with Neural biLM and Symmetric Patterns
Asaf Amrami † and Yoav Goldberg † ‡ † Computer Science Department, Bar Ilan University, Israel ‡ Allen Institute for Artificial Intelligence { asaf.amrami, yoav.goldberg } @gmail.com Abstract
An established method for Word Sense Induc-tion (WSI) uses a language model to predictprobable substitutes for target words, and in-duces senses by clustering these resulting sub-stitute vectors.We replace the ngram-based language model(LM) with a recurrent one. Beyond beingmore accurate, the use of the recurrent LMallows us to effectively query it in a creativeway, using what we call dynamic symmetricpatterns . The combination of the RNN-LMand the dynamic symmetric patterns results instrong substitute vectors for WSI, allowing tosurpass the current state-of-the-art on the Se-mEval 2013 WSI shared task by a large mar-gin.
We deal with the problem of word sense induc-tion (WSI): given a target lemma and a collec-tion of within-sentence usages it, cluster the us-ages ( instances ) according to the different sensesof the target lemma. For example, for the sen-tences:(a) We spotted a large bass in the ocean.(b) The bass player did not receive the acknowl-edgment she deserves.(c) The black sea bass , is a member of thewreckfish family.We would like to cluster (a) and (c) in one groupand (b) in another. Note that some mentions areambiguous. For example, (d) matches both themusic and the fish senses: This example shows homonymy , a case where the sameword form has two distinct meaning. A more subtle case is polysemy , where the senses share some semantic similarity.In “She played a low bass note”, the sense of bass is relatedto the sense in (b), but distinct from it. The WSI task wetackle in this work deals with both cases. (d)
Bass scales are the worst.This calls for a soft clustering , allowing to prob-abilistically associate a given mention to twosenses.The problem of WSI has been extensively stud-ied with a series of shared tasks on the topic(Agirre and Soroa, 2007; Manandhar et al., 2010;Jurgens and Klapaftis, 2013), the latest beingSemEval 2013 Task 13 (Jurgens and Klapaftis,2013). Recent state-of-the-art approaches to WSIrely on generative graphical models (Lau et al.,2013; Wang et al., 2015; Komninos and Manand-har, 2016). In these works, the sense is modeledas a latent variable that influences the context ofthe target word. The later models explicitly differ-entiate between local (syntactic, close to the dis-ambiguated word) and global (thematic, semantic)context features.
Substitute Vectors
Baskaya et al. (2013) take adifferent approach to the problem, based on sub-stitute vectors . They represent each instance as adistribution of possible substitute words, as deter-mined by a language model (LM). The substitutevectors are then clustered to obtain senses.Baskaya et al. (2013) derive their probabilitiesfrom a 4-gram language model. Their system (AI-KU) was one of the best performing at the timeof SemEval 2013 shared task. Our method isinspired by the AI-KU use of substitution basedsense induction, but deviate from it by moving to arecurrent language model. Besides being more ac-curate, this allows us to further improve the qual-ity of the derived substitutions by the incorpora-tion of dynamic symmetric patterns.
BiLM
Bidirectional RNNs were shown to be ef-fective for word-sense disambiguation and lexi-cal substitution tasks (Melamud et al., 2016; Yuanet al., 2016; Raganato et al., 2017). We adopt theELMo biLM model of Peters et al. (2018), which a r X i v : . [ c s . C L ] A ug ccepted as a short paper in EMNLP 2018 was shown to produce very competitive results formany NLP tasks. We use the pre-trained ELMobiLM provided by Peters et al. (2018). How-ever, rather than using the LSTM state vectorsas suggested in the ELMo paper, we opt insteadto use the predicted word probabilities. Movingfrom continuous and opaque state vectors to dis-crete and transparent word distributions allows farbetter control of the resulting representations (e.g.by sampling, re-weighting and lemmatizing thewords) as well as better debugging opportunities.As expected, the move to the neural biLMalready outperforms the AI-KU system, andmatches the previous state-of-the-art. However,we observe that the substitute vectors do not takeinto account the disambiguated word itself. Wefind that this often results in noisy substitutions.As a motivating example, consider the sentence“the doctor recommends oranges for your health”.Here, running is a perfectly good substitution, asthe “fruitness” of the target word itself isn’t rep-resented in the context. We would like the sub-stitutes word distribution representing the targetword to take both kinds of information—the con-text as well as the target word—into account.
Dynamic Symmetric Patterns
Our main pro-posal incorporates such information. It is moti-vated by Hearst patterns (Hearst, 1992; Widdowsand Dorow, 2002; Schwartz et al., 2015), andmade possible by neural LMs. Neural LMs arebetter in capturing long-range dependencies, andcan handle and predict unseen text by general-izing from similar contexts. Conjunctions, andin particular the word and , are known to com-bine expressions of the same kind. Recently,Schwartz et al. (2015) used conjunctive sym-metric patterns to derive word embeddings thatexcel at capturing word similarity. Similarly,Kozareva et al. (2008) search for doubly-anchoredpatterns including the word and in a large web-corpus to improve semantic-class induction. Themethod of Schwartz et al. (2015) result in context-independent embeddings, while that of Kozarevaet al. (2008) takes some context into account but isrestricted to exact corpus matches and thus suffersa lot from sparsity.We make use of the rich sequence represen-tation capabilities of the neural biLM to de-rive context-dependent symmetric pattern substi- We thank the ELMo team for sharing the pre-trainedmodels. tutions . Relying on the generalization propertiesof neural language models and the abundance ofthe “X and Y” pattern, we present the languagemodel with a dynamically created incomplete pat-tern, and ask it to predict probable completioncandidates. Rather than predicting the word dis-tribution following the doctor recommends , weinstead predict the distribution following the doc-tor recommends oranges and . This providessubstantial improvement, resulting in state-of-the-art performance on the SemEval 2013 shared task.The code for reproducing the experiments andour analyses is available at https://github.com/asafamr/SymPatternWSI . Given a target word (lemma and its part-of-speechpair), together with several sentences in which thetarget word is used (instances), our goal is to clus-ter the word usages such that each cluster corre-sponds to a different sense of the target word. Fol-lowing the SemEval 2013 shared task and motivat-ing example (d) from the introduction, we seek asoft (probabilistic) clustering, in which each wordinstance is assigned with a probability of belong-ing to each of the sense-clusters.Our algorithm works in three stages: (1) Wefirst associate each instance with a probability dis-tribution over in-context word-substitutes. Thisprobability distribution is based on a neural biLM(section 2.1). (2) We associate each instance with k representatives, each containing multiple sam-ples from its associated word distributions (sec-tion 2.3). (3) Finally, we cluster the representa-tives and use the hard clustering to derive a soft-clustering over the instances (section 2.4).We use the pre-trained neural biLM as a black-box, but use linguistically motivated processing ofboth its input and its output: we rely on the gen-eralization power of the biLM and query it using dynamic symmetric patterns (section 2.2); and welemmatize the resulting word distributions. Running example
In what follows, we demon-strate the algorithm using a running example ofinducing senses from the word sound , focusing onthe instance sentence:
I liked the sound of the harpsichord.
We follow the ELMo biLM approach (Pe-ters et al., 2018) and consider two separately trained language models, a forward modeltrained for predicting p → ( w i | w , ..., w i − ) and a backward model p ← ( w i | w n , ..., w i +1 ) .Rather than combining the two models’ pre-dictions into a single distribution, we simplyassociate the target word with two distribu-tions, one from p → and one from p ← . Forconvenience, we use LM → ( w w ...w i − ) todenote the distribution p → ( w i | w , ..., w i − ) and LM ← ( w i +1 w i +2 ...w n ) to denote p ← ( w i | w n , ..., w i +1 ) . Context-based substitution
In the purelycontext-based setup (the one used in the AI-KUsystem) we represent the target word sounds bythe two distributions: LM → ( I liked the ) LM ← ( of the harpsichord ) The resulting top predictions from each distribu-tion are: { idea:0.12, fact:0.07, article: 0.05, guy: 0.04,concept: 0.02 } and { sounds:0.04, version: 0.03, rhythm: 0.03,strings: 0.03, piece: 0.02 } respectively. As discussed in the introduction, conditioningsolely on context is ignoring valuable infor-mation. This is evident in the resulting worddistributions. We use the coordinative sym-metric pattern
X and Y in order to produce asubstitutes vector incorporating both the wordand its context. Concretely, we represent atarget word w i by p → ( w (cid:48) | w , ..., w i , and ) and p ← ( w (cid:48) | w n , ..., w i , and ) . For our running exam-ple, this translates to: LM → ( I liked the sound and ) LM ← ( and sound of the harpsichord . ) with resulting top words: { feel: 0.15, felt: 0.11,thought: 0.07, smell: 0.06, sounds: 0.05 } and { sight: 0.16, sounds: 0.11, rhythm: 0.04, tone:0.03, noise: 0.03 } .The distributions predicted using the and pat-tern exhibit a much nicer behavior, and incorpo-rate global context (resulting in sensing relatedsubstitutes) as well as local and syntactic informa-tion that resulting from the target word itself. Ta-ble 1 compares the context-only and symmetric-pattern substitutes for two senses of the word sound . To perform fuzzy clustering, we follow AI-KUand associate each instance with k representatives,but deviate in the way the representatives are gen-erated. Specifically, each representative is a set ofsize (cid:96) , containing (cid:96) samples from the forwarddistribution and (cid:96) samples from the backward dis-tribution. In the symmetric pattern case above, aplausible representative, assuming (cid:96) = 2 , wouldbe: { feel, sounds, sight, rhythm } where two wordswere predicted by each side LM. In this work, weuse (cid:96) = 4 and k = 20 . After obtaining k representatives for each of the n word instances, we cluster the nk representa-tives into distinct senses and translate this hard-clustering of representatives into a probabilisticclustering of the originating instances. Hard-clustering of representatives
Let V bethe vocabulary obtained from all the represen-tatives. We associate each representative witha sparse | V | dimensional bag-of-features vector,and arrange the representatives into a nk × | V | matrix M where each row corresponds to a repre-sentative. We now cluster M ’s rows into senses.We found it is beneficial to transform the matrixusing TF-IDF. Treating each representative as adocument, TF-IDF reduces the weight of uninfor-mative words shared by many representatives. Weuse agglomerative clustering (cosine distance, av-erage linkage) and induce a fixed number of clus-ters. We use sklearn (Pedregosa et al., 2011)for both TF-IDF weighting and clustering.
Inducing soft clustering over instances
Afterclustering the representatives, we induce a soft-clustering over the instances by associating eachinstance j to sense i based on the proportion ofrepresentatives of j that are assigned to cluster i . The WSI task is defined overlemmas, and some target words have morpholog-ical variability within a sense. This is especiallycommon with verb tenses, e.g., “I booked a flight”and “I am booking a flight”. As the conjunctive In this work, we use 7 clusters, which roughly matchesthe number of senses for each target word in the corpus. Dy-namically selecting the number of clusters is left for futurework. The effect of changing the number of clusters is ex-plored in the supplementary material.
Context Only Symmetric PatternForward dist. Backward dist. Forward dist. Backward dist.This is a sound idea, I like it.sad 0.02 bad 0.12 welcome 0.09 funny 0.10great 0.02 good 0.09 practical 0.03 beautiful 0.05huge 0.02 great 0.06 comprehensive 0.03 fun 0.04very 0.02 wonderful 0.05 light 0.02 simple 0.04lesson 0.02 nice 0.04 balanced 0.02 interesting 0.03I liked the sound of the harpsichordidea 0.12 sounds 0.04 feel 0.15 sight 0.16fact 0.07 version 0.03 felt 0.11 sounds 0.11article 0.05 rhythm 0.03 thought 0.07 rhythm 0.04guy 0.04 strings 0.03 smell 0.06 tone 0.03concept 0.02 piece 0.03 sounds 0.05 noise 0.03
Table 1: Predicted substitutes for two senses of sound, for context-only and the symmetric-pattern approaches. symmetric pattern favors morphologically-similarwords, the resulting substitute vectors for thesetwo sentences will differ, each of them agreeingwith the tense of its source instance. To deal withthis, we lemmatize the predictions made by thelanguage model prior to adding them to the rep-resentatives. Such removal of morphological in-flection is straightforward when using the worddistributions but much less trivial when using rawLM state vectors, further motivating our choice ofworking with the word distributions. The substan-tial importance of the lemmatization is exploredin the ablation experiments in the next section, aswell as in the supplementary material.
Distribution cutoff and bias
Low ranked LMprediction tend to become noisier. We thus con-sider only the top 50 word predicted by eachLM, re-normalizing their probabilities to sum toone. Additionally, we ignore the final bias vec-tor during prediction (words are predicted via sof tmax ( W x ) rather than sof tmax ( W x + b ) ).This removes unconditionally probable (frequent)words from the top LM predictions. We evaluate our method on the SemEval 2013Task 13 dataset (Jurgens and Klapaftis, 2013),containing 50 ambiguous words each withroughly 100 in-sentence instances, where each in-stance is soft-labeled with one or more WordNetsenses.
Experiment Protocol
Due to the stochastic na-ture of the algorithm, we repeat each experiment30 times and report the mean scores together with the standard deviation.
Evaluation metrics
We follow previous work(Wang et al., 2015; Komninos and Manandhar,2016) and evaluate on two measures:
Fuzzy Nor-malized Mutual Information (FNMI) and
FuzzyB-Cubed (FBC) as well as their geometric mean(
AVG ). Systems
We compare against three graphical-model based systems which, as far as we know,represent the current state of the art:
MCC-S (Komninos and Manandhar, 2016),
Sense-Topic (Wang et al., 2015) and unimelb (Lau et al.,2013). We also compare against the
AI-KU sys-tem. Wang et al. also present a method for datasetenrichment that boosted their model performance.We didn’t use the suggested methods and compareourselves to the vanilla settings, but report the en-richment numbers as well.
Results
Table 2 summarizes the results. Oursystem using symmetric patterns outperforms allother setups with an AVG score of 25.4, establish-ing a new state-of-the-art on the task.
Ablation and analysis
We perform ablations toexplore the contribution of the different compo-nents (Symmetric Patterns (SP), Lemmatization(LEM) and TF-IDF re-weighting). Figure (1)shows the results for the entire dataset (ALL, top),as well as broken-down by part-of-speech. Allcomponents are beneficial and are needed for ob-taining the best performance in all cases. How-ever, their relative importance differs across parts-of-speech. Adjectives gain the most from the useof the dynamic symmetric patterns, while nouns
Model FNMI FBC AVGOriginal task datasetOurs ± ± ± Table 2: Evaluation Results on the SemEval 2013 Task 13 Dataset. SW: Embeddings similarity based featureweighting. AAC: Extending instance sentences from their traced source. AUC: Adding similar sentences from thedataset originating corpus. We report our mean scores over 30 runs ± standard deviationFigure 1: Ablation break down by part of speech, eachpart of speech was averaged across run. Bars are meanof means and error bars are standard deviations. gain the least. For verbs, the lemmatization is cru-cial for obtaining good performance, especiallywhen symmetric patterns are used: using symmet-ric patterns without lemmatization, the mean scoredrops to 17.0. Lemmatization without symmet- ric patterns achieves a higher mean score of 20.5,while using both yields 22.8. Finally, for nouns itis the TF-IDF scoring that plays the biggest role. We describe a simple and effective WSI methodbased on a neural biLM and a novel dynamic ap-plication of the
X and Y symmetric pattern. Themethod substantially improves on the state-of-the-art. Our results provide further validation thatRNN-based language models contain valuable se-mantic information.The main novelty in our proposal is queryingthe neural LM in a creative way, in what we call dynamic symmetric patterns . We believe that theuse of such dynamic symmetric patterns (or moregenerally dynamic Hearst patterns ) will be bene-ficial to NLP tasks beyond WSI.In contrast to previous work, we used discretepredicted word distributions rather than the con-tinuous RNN states. This paid off by allowing usto inspect and debug the representation, as well tocontrol it in a meaningful way by injecting linguis-tic knowledge in the form of lemmatization, andby distributional cutoff and TF-IDF re-weighing.We encourage others to consider using explicit,discrete representations when appropriate.
Acknowledgments
The work was supported inpart by the Israeli Science Foundation (grant num-ber 1555/15 and the German Research Foundationvia the German-Israeli Project Cooperation (DIP,grant DA 1600/1-1).
References
Eneko Agirre and Aitor Soroa. 2007. Semeval-2007task 02: Evaluating word sense induction and dis-crimination systems. In
Proceedings of the 4th In-ternational Workshop on Semantic Evaluations , Se-mEval ’07, pages 7–12, Stroudsburg, PA, USA. As-sociation for Computational Linguistics.Osman Baskaya, Enis Sert, Volkan Cirik, and DenizYuret. 2013. Ai-ku: Using substitute vectors andco-occurrence modeling for word sense inductionand disambiguation. In
Second Joint Conferenceon Lexical and Computational Semantics (* SEM),Volume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013) ,volume 2, pages 300–306.Marti A Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In
Proceedings ofthe 14th conference on Computational linguistics-Volume 2 , pages 539–545. Association for Compu-tational Linguistics.David Jurgens and Ioannis Klapaftis. 2013. Semeval-2013 task 13: Word sense induction for graded andnon-graded senses. In
Second Joint Conference onLexical and Computational Semantics (* SEM), Vol-ume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013) ,volume 2, pages 290–299.Alexandros Komninos and Suresh Manandhar. 2016.Structured generative models of continuous featuresfor word sense induction. In
Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers , pages3577–3587.Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy.2008. Semantic class learning from the web withhyponym pattern linkage graphs. In
Proceedings ofACL-08: HLT , pages 1048–1056, Columbus, Ohio.Association for Computational Linguistics.Jey Han Lau, Paul Cook, and Timothy Baldwin. 2013.unimelb: Topic modelling-based word sense induc-tion. In
Second Joint Conference on Lexical andComputational Semantics (* SEM), Volume 2: Pro-ceedings of the Seventh International Workshop onSemantic Evaluation (SemEval 2013) , volume 2,pages 307–311.Suresh Manandhar, Ioannis P Klapaftis, Dmitriy Dli-gach, and Sameer S Pradhan. 2010. Semeval-2010task 14: Word sense induction & disambiguation. In
Proceedings of the 5th international workshop onsemantic evaluation , pages 63–68. Association forComputational Linguistics.Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional lstm. In
Proceedingsof The 20th SIGNLL Conference on ComputationalNatural Language Learning , pages 51–61. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine learningin Python.
Journal of Machine Learning Research ,12:2825–2830.Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365 .Alessandro Raganato, Claudio Delli Bovi, and RobertoNavigli. 2017. Neural sequence learning models forword sense disambiguation. In
Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 1156–1167.Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015.Symmetric pattern based word embeddings for im-proved word similarity prediction. In
Proceed-ings of the Nineteenth Conference on Computa-tional Natural Language Learning , pages 258–267.Jing Wang, Mohit Bansal, Kevin Gimpel, Brian DZiebart, and T Yu Clement. 2015. A sense-topicmodel for word sense induction with unsuperviseddata enrichment.
Transactions of the Association ofComputational Linguistics , 3(1):59–71.Dominic Widdows and Beate Dorow. 2002. A graphmodel for unsupervised lexical acquisition. In
Pro-ceedings of the 19th International Conference onComputational Linguistics - Volume 1 , COLING’02, pages 1–7, Stroudsburg, PA, USA. Associationfor Computational Linguistics.Dayu Yuan, Julian Richardson, Ryan Doherty, ColinEvans, and Eric Altendorf. 2016. Semi-supervisedword sense disambiguation with neural models. arXiv preprint arXiv:1603.07012 .6ccepted as a short paper in EMNLP 2018
Supplementary Material
Statistics of the SemEval 2013 Task 13 Dataset
SemEval 2013 Task 13 consists of 50 targets, eachhas a lemma and a part of speech (20 verbs, 20nouns and 10 adjectives). We use the datasetonly for evaluation. Most targets have around100 labeled instances (sentences containing a us-age of the target in its designated part of speechtogether with one or more WordNet senses as-signed by human labeler). Exceptions are the tar-gets of trace.n and book.v which have 37 and 22labeled instances accordingly. Leaving out thetwo anomalous targets mentioned above we areleft with 4605 instances from 48 targets: 19 verb,19 noun and 10 adjective targets. We note that thesmall size of the dataset should make one cautiousto draw quick conclusions, yet, our results seem tobe consistent.
Effect of the Choice of Number of Clusters
An important statistic of the dataset is the num-ber of senses per target. The average number ofsenses per target in the dataset is 6.94 (stdev:2.71).Breaking down by part of speech, the means andstandard deviations of target senses are: verbs:5.90 ( ± ± ± The Importance of Lemmatization
The ablation results in the paper indicate that forverbs, using symmetric patterns without lemmati-zation yields poor results. We present the analysisthe motivated our use of lemmatization. Considerthe samples from the biLM with and without sym-metric patterns, for the instance
It was when I wasa high-school student that I became convinced ofthis fact for the first time.
Figure 2: AVG score by number of clusters. fw LM, no SP: didn, write, ’d, learnt, startbw LM, no SP: seem, be, grow, be, befw LM, with SP: went, got, started, wasn,lovedbw LM, with SP: 1990s, decade, 1980s,afterwards, changedAnother sentence, in another tense:
The issue will become more pressing as an estimated 40,000 to50,000 Chinese, mostly unskilled, come to settleeach year. fw LM, no SP: be, be, remain, likely, bebw LM, no SP: becoming, grownbecoming, much, becomesfw LM, with SP: remains, remain, which,continue, howbw LM, with SP: rising, overseas, booming,abroad, expandedWhen using the symmetric patterns, the pre-dicted verbs tend to share the tense of the targetword.This results in targets of different tenses hav-ing nearly distinct distributions, even when thetargets share the same sense, splitting the singlesense cluster to two (or more) tense clusters. Wequantify this intuition by computing the correla-tion between tense and induced clusters (senses),as given by the Normalized Mutual Information(NMI). We measure NMI between verb instancetense in sentence and their most probable inducedcluster in the different settings, as well as the NMIof the verb instances and the gold clusters. Ta-ble 3 summarize the results. We see that in thegold clusters there is indeed very little correlation (0.15) between the the tense and the sense. Whenusing SP but not lemmatization (w/o LEM), thecorrelation is substantially higher (0.67). Whennot using either lemmatization of SP (w/o LEMand SP) the correlation is 0.27, much closer to thegold one. Performing explicit lemmatization natu-rally reduces the correlation with tense, and usingthe full model (Final model) results in a correla-tion to 0.22, close to the gold number of 0.15.
Some Failure Modes of Dynamic SymmetricPatterns
While the use of dynamic symmetric patterns im-proves performance and generally produces goodsubstitutes for contextualized words, we also iden-tify some failure modes and unexpected behavior.
Common phrases involving conjunctions
Some target words have a strong prior to appearin common phrases involving a conjunction,causing the strong local pattern to overridecontext-based hints. For example, when the LMis asked to complete ... state and , its prioron church makes it a very probable completion,regardless of context and sense. This phenomenamotivated our use TF-IDF for weighing of toocommon words. Relatedly, a common completionfor symmetric patterns is the word then , as andthen is a very common phrase. This completioneven ignores the target word and could be trou-blesome if a global, cross-lemma, clustering isattempted.
Multi word phrases substitutes
Sometime theLM does interpret the and as a trigger for a sym-metric relation, but on a chunk extending beyondthe target word. For example, when presentedwith the query
The human heart not only makesheart sounds and , the forward LM predicted inits top twenty suggestions the word muscle , fol-lowed by a next-word prediction of movements .That is, the symmetry extends beyond “sounds” tothe phrase “heart sounds” which could be substi-tutes by “muscle movements”. We didn’t specif-ically address this in the current work, but notethat restricting the prediction to agree with the tar-get word on part-of-speech and plurality may helpin mitigating this. Furthermore, this suggests anexciting direction for moving from single wordstowards handling of multi-word units. Settings NMI (mean ± STD)Gold labels 0.15 ± ± ± ± w/o LEM 0.67 ± w/o LEM and SP 0.26 ± ±0.08