[PDF] Semi-supervised Learning for Word Sense Disambiguation

Abstract

This work is a study of the impact of multiple aspects in a classic unsupervised word sense disambiguation algorithm. We identify relevant factors in a decision rule algorithm, including the initial labeling of examples, the formalization of the rule confidence, and the criteria for accepting a decision rule. Some of these factors are only implicitly considered in the original literature. We then propose a lightly supervised version of the algorithm, and employ a pseudo-word-based strategy to evaluate the impact of these factors. The obtained performances are comparable with those of highly optimized formulations of the word sense disambiguation method.

Full PDF

SSemi-supervised Learning for Word SenseDisambiguation (cid:63)

Dar´ıo Garigliotti

FAMAF - Faculty of Mathematics, Astronomy, Physics and ComputationNational University of C´ordoba, Argentina [email protected]

Abstract.

This work is a study of the impact of multiple aspects in aclassic unsupervised word sense disambiguation algorithm. We identifyrelevant factors in a decision rule algorithm, including the initial labelingof examples, the formalization of the rule conﬁdence, and the criteria foraccepting a decision rule. Some of these factors are only implicitly con-sidered in the original literature. We then propose a lightly supervisedversion of the algorithm, and employ a pseudo-word-based strategy toevaluate the impact of these factors. The obtained performances are com-parable with those of highly optimized formulations of the word sensedisambiguation method.

Keywords:

Natural language processing, Word sense disambiguation,Semi-supervised learning

Word sense disambiguation consists in determining the correct sense of a partic-ular occurrence of a polysemous word. Disambiguating word occurrences is a keytask for many basic Natural Language Processing (NLP) components, since theyassume that there is no ambiguity in an input text. A strategy widely adoptedby the methods for this task is to emulate the disambiguation process carriedout by humans when reading, this is, by using information about the context ofa word occurrence.This work is an evaluation of the impact of several aspects in a classic un-supervised word sense disambiguation algorithm. By a close inspection of thisalgorithm, we identify factors that result relevant for its performance, some ofwhich are only implicitly considered in the original literature where the algo-rithm is ﬁrstly described. We propose a lightly supervised version of the decisionrule algorithm for word sense disambiguation. We make use of an evaluation (cid:63)

This work was awarded the Third Place in the EST 2013 Contest (ISSN 1850-2946)at the 42nd JAIIO (Annals of 42nd JAIIO - Argentine Journals of Informatics -ISSN 1850-2776). The original article in Spanish, titled “Desambiguaci´on de PalabrasPolis´emicas mediante Aprendizaje Semi-supervisado”, can be found here: http://42jaiio.sadio.org.ar/proceedings/simposios/Trabajos/EST/20.pdf . a r X i v : . [ c s . C L ] A ug strategy based on pseudo-words to show that our approach achieves comparableperformance to those of highly optimized versions of the algorithm.The rest of this paper is organized as follows. Section 2 describes related lit-erature in word sense disambiguation, in particular, the unsupervised algorithmstudied in this work. Our semi-supervised decision rule approach is then pre-sented in Section 3. Next, Section 4 discusses the experimental results. Finally,in Section 5 we present the conclusions, and mention a number of directions forfuture investigation. Word sense disambiguation has been addressed with supervised learning meth-ods. These require a large dataset of examples, typically sentences with theoccurrence of an ambiguous word, labeled by hand with the correct sense.Yarowsky [1] introduces an unsupervised algorithm that combines the struc-ture of decision lists with the unsupervised labeling of a large amount of initialexamples. Given an ambiguous word, referred to as target, this labeling is per-formed by determining a seed collocation or evidence that is representative ofeach of the senses of the target. In this way, it preserves the lack of supervision,yet it is not fully unsupervised since it requires labeled examples. Moreover, it ispossible to introduce bias in the choice of the initial collocations. This method,hereinafter referred to as the Yarowsky algorithm, uses two main heuristics: (i) one sense per collocation , meaning that words co-occurring with the target givestrong clues about its correct sense, and (ii) one sense per discourse , by whichit can be assumed that the target has the same sense in multiple occurrenceswithin the same discourse or text. The work by Abney [2] is the ﬁrst systematicstudy of this algorithm.

The main part of this section details the unsupervised algorithm introducedby Yarowsky, where we also discuss the identiﬁed factors and the approach weobtain. This section ends with the description of the processing steps used toobtain the dataset from a textual corpus.

Given a set of instances, each with an occurrence of a ﬁxed target word, theYarowsky algorithm consists in performing the following steps: • Determine representative collocations, and split the instances set in subsetsaccording to the sense labels. • While the non-labeled instances set does not converge:- Learn a decision rule of the shape collocation ⇒ sense , by inspecting thelabeled examples. - Add to the decision list the rules that meet the acceptance criteria.- Sort the decision list by rule conﬁdence.- Apply the decision list to the entire instances set, labeling each examplewith the ﬁrst rule in the list that meet the rule condition.- Apply the “one sense per discourse” criterion to possibly ﬁlter out ex-amples labeled in the step above. • After converging to a residual set of non-labeled examples, the ﬁnal ruledecision list can be applied on examples not seen yet.In the algorithm introduced by Yarowsky [1], many criteria are slightly jus-tiﬁed, and if so, appear without their respective (optimal) parameters. Otherimportant algorithm factors are not even mentioned explicitly. Speciﬁcally, inthis work we identify the following factors, and propose the respective experi-mental settings for our version of the algorithm: • Factor 1: Unlike relying on seed collocations as previously mentioned, weperform the initial labeling by manually disambiguating two examples persense, for each target word. In this way, the learning algorithm becomes semi-supervised. As we show in Section 4, this setting is very important. • Factor 2: Regarding types of collocations or evidences , Yarowsky considers co-occurrence adjacent to any other word in the context (typically a sentencesurrounding the target), co-occurrence adjacent to content words (i.e., nouns,adjectives, verbs or adverbs) versus function words, and co-occurrence in awindow of +/- n to m words, for given n, m . The original algorithm analysisdoes not mention the impact of which of those collocation types to use, norwith which parameters, e.g., for n, m in the windows-based case. In this work,we only use simple co-occurrence collocations, i.e., a word co-occurring withthe target in any position in the context. • Factor 3: We discard the “one sense per discourse” heuristics. • Factor 4: The equation of rule conﬁdence , or rule probability, that we consideris diﬀerent from the classic based on log-likelihood that is used in the originalalgorithm [1]. Given f ( E i ), the number of examples with the collocation orevidence E i , and f ( S j , E i ), the number of examples with evidence E i labeledwith the sense S j , an optimization [3] allows us to use the following equationof the conﬁdence that a given evidence determines a given sense:conﬁdence = f ( S j , E i ) f ( S j , E i ) + f ( ¬ S j , E i ) = f ( S j , E i ) f | { labeled examples } ( E i ) . (1) • Factor 5: The Yarowsky algorithm does not mention the conﬁdence threshold as part of the acceptance criteria for a decision rule into the decision list.Abney [2] proposes a threshold equal to 1/ L , with L being the number ofsenses known for the target. We work with a very strict conﬁdence thresholdinitially equal to 0.95. • Factor 6: Meanwhile Yarowsky [1] allows to remove the label of an examplewhen the rule that gave it such a label falls below the threshold, Abney [2]forces to always preserve a label once an example has been disambiguated,but the label can be changed. In our work, a labeled example remains withits ﬁrst assigned sense, i.e., it cannot be removed or changed. • Factor 7: In the case of having the same conﬁdence, we sort two rules accord-ing to a second sorting criteria, the one of coverage , this is, the number ofexamples whose collocation meets the rule condition. Note that rule coverageis equal to the denominator in Eq. (1). This factor is not mentioned in theYarowsky algorithm [1], yet it is considered by Tsuruoka and Chikayama [3].

Given a textual corpus, we process it to construct the dataset as follows. • We retain every line of raw text in the corpus only if it is made of at leastten words. • We manually disambiguate two sentences per sense for each given targetword, and label accordingly in the corpus. We refer sometimes to this stageas the initial training. • We perform lemmatization and part-of-speech (POS) tagging. Since we workedwith a corpus in Spanish, in this step we apply the FreeLing • We preserve only content words, i.e., nouns, main verbs, and qualiﬁcativeadjectives. These correspond, respective, with the POS-tags of the shapeN****, VM****, and AQ****, in the parole-EAGLES standard. • We split this lemmatized, POS-tagged corpus into sentences, i.e., contexts. • We select those contexts where the target occurs. • We identify the contexts manually disambiguated in the ﬁrst step. • We take the set of every diﬀerent lemma occurring in any of these contexts. • We build our lexicon, by restricting the set of lemmas deﬁned above tocontain a lemma only if it occurs in at least ten contexts. • We sort the lexicon according to the number of contexts each lemma occursin. • We vectorize the corpus according to the lexicon truncated by frequency ofits lemmas, to obtain a training dataset where each instance has a counterof co-occurrences of each lemma with the target in that context.Two assumptions are worthy to be mentioned regarding the steps describedabove. First, by performing lemmatization, it is intuitively assumed that mor-phological accidents do not alter the sense of the word in that context. Second,the lexicon is truncated as a usual practice due to the sparseness phenomenon.This happens when the vocabulary is very large and the words occurring in acontext are generally diﬀerent from the ones used in another context. http://nlp.lsi.upc.edu/freeling/index.php/node/1 In this section, we describe our evaluation strategy, and the experimental set-tings, as well as we analyze the experimental results.

The evaluation strategy based on pseudo-words was introduced by Sch¨utze [4]as a simple and very economic method to evaluate word sense disambiguationalgorithms. This method consists in choosing randomly two words, for exam-ple “banana” and “door,” and replacing in a text corpus every occurrence ofany of the two by the new target pseudo-word “bananadoor.” A word sensedisambiguation algorithm is applied on the selected contexts. After this, eachexample is considered correctly disambiguated if the assigned sense (“banana”or “door”) coincides with the original word that was replaced in that context bythe pseudo-word. Even though the sense ambiguity introduced by this methodcan be seen as artifactual, the advantage is to produce large amounts of labeledexamples for evaluation with almost no cost. This evaluation method also showsthe independence of the disambiguation algorithm with respect to the languageof the corpus which is applied on, since it only uses the impact of the givencollocations and does not assume any convention or bias in the language. Giventhe two words that are often chosen to explain the pseudo-word replacement,this evaluation method in general is also known as bananadoor evaluation .We employ two simple word sense disambiguation algorithms with which tocompare the performance of our proposed approach: • Baseline: given the word most often replaced by the pseudo-word —e.g.,“door”—, corresponding to a k % of all the replacements, the baseline labelsevery context with the most frequent sense —in this example, “door”— andobtains an accuracy of k %. • Random: using the same information that s is the most frequent sense with k % of the replacements, this algorithm labels each context probabilistically,assigning the sense s a k % of the times. We work with a corpus of 57 million words in Spanish language, consisting ofdigital articles of the Spanish newspapers La Vanguardia and El Peri´odico deCatalunya. For simplicity, our problem is constrained to disambiguating thesense of words with the following properties: • Every target has only two senses, and are generally very diﬀerent. • Every target is a noun, and all its senses are nouns. • Possibly, targets considered polysemous include cases of words that are ac-tually homonyms.

Table 1.

The ﬁve binary targets and statistics of their datasets.Target Sense A Sense B Dataset size Lexicon sizeManzana Fruta Superﬁcie 712 144Naturaleza ´Indole Entorno 2,607 611Movimiento Cambio Corriente 6,509 1,883Tierra Materia Planeta 7,874 2,019Inter´es Finanzas Curiosidad 14,640 3,104

In a ﬁrst part of our experiments, we obtain a set of contexts for each target,in a set of ﬁve selected targets. We refer to the contexts set of each targetas the target dataset. We observe the impact of the identiﬁed factors in thedisambiguation performance, in particular, analyzing, for example, the numberof iterations required for convergence. Table 1 presents statistics of the datasetsfor the ﬁve selected targets. The two senses of each of the binary targets observedin this part are referred to as senses A and B.In a second part, we obtain a new dataset, by applying the bananadoor evalu-ation. The replacements are made using two words in Spanish relatively frequent,“vida” (Spanish for “life”) and “ciudad” (“city”), leading to the pseudo-word“vidaciudad.” This datasets consists of 62,819 examples and a lexicon of 3,937lemmas. Here, the size of the lexicon after truncation is considerably smallerthan the sizes for the datasets per target in the ﬁrst part, which may resultof high importance when applying clustering on the dataset without enoughcomputational resources.

In the ﬁrst part, we apply our approach to each target dataset. The results inTable 2 show a quick convergence into a residual, stable set of non-labeled ex-amples. Clearly, any factor which increases the number of rules in the decisionlist will have a positive impact regarding this convergence. For example, a pos-sible one would be the Factor 1: to perform the initial labeling with the originalalgorithm leads to more rules for the ﬁrst iteration. Another way would involveFactor 3: by implementing also the “one sense per discourse” criterion, each it-eration possibly provides more labeled examples to the next iteration. In bothcases, for a larger dataset, it may lead to less iterations, although each one wouldrequire longer since it would have to inspect more rules. Then, these factors, asassumed in our experimental setting, have a negative impact in the ﬁnal residualset. Factor 2 is of positive impact regarding convergence, due to the fact that alarger number of collocations are able to capture more reliably some linguisticphenomena that escape from the simple co-occurrence setting.Our experimental setting for Factor 6 does not allow re-labeling. This maysuggest a positive impact, yet re-labeling is taken under consideration in theprevious work. Then, it should be closely inspected regarding a sort of “speedversus accuracy” dilemma. In Factor 5, a rule conﬁdence threshold too lenient can

Table 2.

Convergence, and proportions of ﬁnal residual sets of examples.Targetmanzana naturaleza movimiento tierra inter´es vidaciudadNo. of iterationsneeded to converge 5 6 7 8 7 6% of the ﬁnal residualset w.r.t. the dataset 6.46% 33.10% 42.83% 18.97% 33.25% 77.62%

Iteration N u m b e r o f r u l e s Rules rejected due to confidenceRules rejected due to coverageAccepted rules (a) Proportion of decision rules.

Iteration N u m b e r o f e x a m p l e s Unlabeled examplesExamples labeled with sense BExamples labeled with sense A (b) Subsets of examples.

Fig. 1.

Performance of our approach per iteration, for the target word “inter´es.” impact positively in the convergence yet negatively in the accuracy. An elementalready mentioned yet not explicitly identiﬁed among our factors is the lexicontruncation bound, presented in the last column of Table 1. This bound aﬀects theperformance in the second part, when trying to disambiguate the pseudo-target“vidaciudad.” Speciﬁcally, it leaves a large ﬁnal residual set since we require thatthe lemmas of its lexicon occur each in at least 30 contexts. Lowering the boundimproves this situation, but going down excessively in reducing this bound allowsfor entrance of noise from unfrequent lemmas.As pointed out in the related work [3], Eq. (1) can lead to undesired situa-tions, given the few evidences available at the beginning of the learning problem.Figure 1(a) shows the performance per iteration, regarding the proportions ofdecision rules. We can see that the ﬁrst iteration of the algorithm accepts veryfew rules, just within the required coverage, while most of the rules are rejected.The coverage criterion used in our experimental settings (Factor 7) requires asfew as at least one evidence; experiments with any stricter setting results harm-ful due to the sparseness phenomenon, as it rejects every rule in the ﬁrst twoiterations and converges to a ﬁnal set without any newly labeled examples.In summary, performance is very sensitive to Factor 7, since any strictersetting would lead to no disambiguation altogether. This factor is in close relation

Table 3.

Performance of our approach (Decision List), and clustering performance(average across the ﬁve targets).Bananadoor Evaluation Clustering (average)Algorithm Baseline Random Decision List KMeans EM (2 clusters)Accuracy 51.10% 50.13% with our setting of very light supervision for the initial training set. Factor 1,i.e., semi-supervised learning, is crucial in our approach. As we can observe inFig. 1(b), the initial rule decision list is drastically changed in the ﬁrst iterationw.r.t. the proportion of labeled examples in the dataset, and after that, theexample subsets become stable until convergence as the rules get reﬁned due toa larger coverage.Factor 4, the conﬁdence equation, has a particular impact in relationship withthe Factor 7 of coverage. A formalization, given by Tsuruoka and Chikayama [3]—see Eq. (5) in Appendix A— calculates the optimal smoothing for Eq. (1);this optimization is obtained to overcome the problem of low initial coverage.When using the optimized formulation, most of the collocations in the ﬁrstiteration have zero coverage, and by smoothing, they receive a portion of thelarge conﬁdence that the very few rules with non-zero coverage have. Yet, sincethe rules with null coverage are so many, in this way each of these rules gets aprobability less than 10 − , i.e., they are rejected by our threshold, as they willbe by any other reasonable conﬁdence threshold.Moving to the second part of our experiments, we conduct the bananadoorevaluation. Table 3 presents the experimental results in terms of accuracy. Ourdecision list approach outperforms the baseline and random algorithms. Fur-thermore, we can say that it is a reasonable performance when comparing withthe accuracy of 69.4%, achieved by a variant of the literature [3] that employsa log-likelihood-based conﬁdence formulation and optimized coverage of at leastthree evidences. As shown in the last columns of Table 3, the accuracy is stillhigher when observing the average clustering accuracy of the ﬁve targets fromthe ﬁrst part of our experiments. Nevertheless, it is worthy to mention that,due to limitations in the computational resources, clustering was applied on arestricted version of the datasets, made of only the ten most frequent lemmasof the respective lexicons. We would expect a performance signiﬁcantly lower inthe case of clustering the entire dataset, due to the noise introduced by usingthe full lexicon. We have conducted an evaluation of factors relevant to the performance of alightly supervised word sense disambiguation algorithm. Our experimental re-sults indicate that the initial training is a crucial element, regarding both the convergence and accuracy. It is indeed so relevant, that it aﬀects the space of pa-rameters for other factors of large impact, such as the conﬁdence threshold andcoverage. We also observe that an optimization of the conﬁdence equation withsmoothing can result harmful to the performance, due to the same consequencesof the initial labeling. Other factors like re-labeling, “one sense per discourse”criterion, or involving more collocation types, may improve the performance, atthe cost of a slower convergence.There are several additional lines of study that we are interested to explorein future work. Firstly, we could observe in a closer look the impact of some fac-tors identiﬁed in this work. For example, to consider a variant of our approachthat also includes rules according to adjacency of a lemma that belongs to aparticular morphosyntactic category. An instance of this would be the qualiﬁca-tive adjective “human” as a ﬁxed lemma adjacent to “nature” in a phrase like“human nature,” which results in a very distinctive lemma for the “condition”sense of “nature.” Secondly, to introduce a criterion for the coverage factor thatmakes its setting dynamic, so that it becomes stricter as long as the iterationsprogress. In this way, the population of rules could be controlled, by restrictingthem to only the more reliable ones. Another line of future investigation is toobserve the results using an initial training coupled with a misleading bias, thisis, picking contexts containing collocations close to the target that suggest thesense diﬀerent from the one manually labeled. This bias —and the converse one,i.e., picking sentences very representative of a sense for the target— is likelyof high impact, yet with our strategy of initial training, it results a challengeto avoid every possible bias. Lastly, another line of future work is to perform,previous to the application of the word sense disambiguation algorithm, a stageof sense discovery. This could be done, for example, by clustering the original,unlabeled dataset, leading to a more natural label set partition. This line of in-vestigation is of particular interest since it involves one of the key problems inword sense disambiguation, this is, deﬁning the domain of senses.

References

1. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervisedmethods. In

Proc. of the 33rd Annual Meeting of the Association for ComputationalLinguistics (ACL).

Computational Lin-guistics, 30 (3).

Proc. ofthe Sixth Natural Language Processing Paciﬁc Rim Symposium (NLPRS).

AAAI Fall Symposium on Probabilistic Ap-proaches to Natural Language.

A Optimization of the Conﬁdence Rule (Tsuruoka yChikayama [3])

The equation used in the original description of the Yarowsky algorithm [1] tocalculate the conﬁdence of a decision rule is:conﬁdence = log (cid:18) P ( S j | E i ) P ( ¬ S j | E i ) (cid:19) , (2)where E i is the contextual evidence or collocation, and S j the candidate sensefor labeling an example. Tsuruoka y Chikayama [3] obtain another formulation,that produce decision lists equivalent to the ones from Eq. (2):conﬁdence = P ( S j | E i ) . (3)When a large number of examples with evidence E i are available, Eq. (3) canbe estimated, using Maximum Likelihood, as follows: P ( S j | E i ) = f ( S j , E i ) f ( E i ) , (4)where f ( E i ) is the number of examples with evidence E i , and f ( S j , E i ) isthe number of examples with evidence E i and labeled with sense S j .The equation that we use in this work is Eq. 1:conﬁdence = f ( S j , E i ) f ( S j , E i ) + f ( ¬ S j , E i ) = f ( S j , E i ) f | { labeled examples } ( E i ) . This formulation overestimates Eq. (4) by restricting the denominator toonly the set of labeled examples. Equation 1 is slightly corrected in the actualimplementation of the algorithm to avoid division by zero.Using Bayesian learning, Tsuruoka and Chikayama [3] obtain a formulationof the beta distribution that involves Θ = P ( S j | E i ), and so they estimate theoptimal smoothing : conﬁdence = E [ Θ ] = f ( S j , E i ) + 1 f ( E i ) + 2 . (5)(5)