[PDF] Q-WordNet PPV: Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages

Abstract

This paper presents a simple, robust and (almost) unsupervised dictionary-based method, qwn-ppv (Q-WordNet as Personalized PageRanking Vector) to automatically generate polarity lexicons. We show that qwn-ppv outperforms other automatically generated lexicons for the four extrinsic evaluations presented here. It also shows very competitive and robust results with respect to manually annotated ones. Results suggest that no single lexicon is best for every task and dataset and that the intrinsic evaluation of polarity lexicons is not a good performance indicator on a Sentiment Analysis task. The qwn-ppv method allows to easily create quality polarity lexicons whenever no domain-based annotated corpora are available for a given language.

Full PDF

aa r X i v : . [ c s . C L ] F e b Simple, Robust and (almost) Unsupervised Generation of PolarityLexicons for Multiple Languages

I ˜naki San Vicente, Rodrigo Agerri, German Rigau

IXA NLP GroupUniversity of the Basque Country (UPV/EHU)Donostia-San Sebasti´an { inaki.sanvicente,rodrigo.agerri,german.rigau } @ehu.eus Abstract

This paper presents a simple, robust and(almost) unsupervised dictionary-basedmethod, qwn-ppv (Q-WordNet as Person-alized PageRanking Vector) to automati-cally generate polarity lexicons. We showthat qwn-ppv outperforms other automat-ically generated lexicons for the four ex-trinsic evaluations presented here. It alsoshows very competitive and robust resultswith respect to manually annotated ones.Results suggest that no single lexicon isbest for every task and dataset and thatthe intrinsic evaluation of polarity lexiconsis not a good performance indicator ona Sentiment Analysis task. The qwn-ppv method allows to easily create quality po-larity lexicons whenever no domain-basedannotated corpora are available for a givenlanguage.

Opinion Mining and Sentiment Analysis are im-portant for determining opinions about commer-cial products, on companies reputation manage-ment, brand monitoring, or to track attitudes bymining social media, etc. Given the explosion ofinformation produced and shared via the Internet,it is not possible to keep up with the constant ﬂowof new information by manual methods.Sentiment Analysis often relies on the availabil-ity of words and phrases annotated according tothe positive or negative connotations they convey.‘Beautiful’, ‘wonderful’, and ‘amazing’ are exam-ples of positive words whereas ‘bad’, ‘awful’, and‘poor’ are examples of negatives.The creation of lists of sentiment words hasgenerally been performed by means of manual-,dictionary- and corpus-based methods. Manuallycollecting such lists of polarity annotated words is labor intensive and time consuming, and is thususually combined with automated approaches asthe ﬁnal check to correct mistakes. However,there are well known lexicons which have beenfully (Stone et al., 1966; Taboada et al., 2010)or at least partially manually created (Hu and Liu, 2004; Riloff and Wiebe, 2003).

Dictionary-based methods rely on some dic-tionary or lexical knowledge base (LKB) suchas WordNet (Fellbaum and Miller, 1998) thatcontain synonyms and antonyms for each word. Asimple technique in this approach is to start withsome sentiment words as seeds which are thenused to perform some iterative propagation on theLKB (Hu and Liu, 2004; Kim and Hovy, 2004;Takamura et al., 2005; Turney and Littman, 2003;Mohammad et al., 2009;Agerri and Garc´ıa-Serrano, 2010;Baccianella et al., 2010).

Corpus-based methods have usually beenapplied to obtain domain-speciﬁc polarity lexi-cons: they have been created by either startingfrom a seed list of known words and tryingto ﬁnd other related words in a corpus or byattempting to directly adapt a given lexiconto a new one using a domain-speciﬁc cor-pus (Hatzivassiloglou and McKeown, 1997;Turney and Littman, 2003; Ding et al., 2008;Choi and Cardie, 2009; Mihalcea et al., 2007).One particular issue arising from corpus meth-ods is that for a given domain the same wordcan be positive in one context but negative inanother. This is also a problem shared by manualand dictionary-based methods, and that is why qwn-ppv also produces synset-based lexicons forapproaches on Sentiment Analysis at sense level.This paper presents a simple, robust and(almost) unsupervised dictionary-based method,

QWordNet-PPV (QWordNet by PersonalizedPageRank Vector) to automatically generatepolarity lexicons based on propagating someutomatically created seeds using a Personal-ized PageRank algorithm (Agirre et al., 2014;Agirre and Soroa, 2009) over a LKB projectedinto a graph. We see qwn-ppv as an effectivemethodology to easily create polarity lexicons forany language for which a WordNet is available.This paper empirically shows that: (i) qwn-ppv outperforms other automatically generated lexi-cons (e.g. SentiWordNet 3.0, MSOL) on the 4extrinsic evaluations presented here; it also dis-plays competitive and robust results also with re-spect to manually annotated lexicons; (ii) no singlepolarity lexicon is ﬁt for every Sentiment Analy-sis task; depending on the text data and the taskitself, one lexicon will perform better than oth-ers; (iii) if required, qwn-ppv efﬁcently generatesmany lexicons on demand, depending on the taskon which they will be used; (iv) intrinsic evalua-tion is not appropriate to judge whether a polar-ity lexicon is ﬁt for a given Sentiment Analysis(SA) task because good correlation with respect toa gold-standard does not correspond with correla-tion with respect to a SA task; (v) it is easily ap-plicable to create qwn-ppv(s) for other languages ,and we demonstrate it here by creating many po-larity lexicons not only for English but also forSpanish; (vi) the method works at both word andsense levels and it only requires the availabilityof a LKB or dictionary; ﬁnally, (vii) a dictionary-based method like qwn-ppv allows to easily cre-ate quality polarity lexicons whenever no domain-based annotated reviews are available for a givenlanguage. After all, there usually is available adictionary for a given language; for example, theOpen Multilingual WordNet site lists WordNetsfor up to 57 languages (Bond and Foster, 2013).Although there has been previous work us-ing graph methods for obtaining lexicons viapropagation, the qwn-ppv method to combinethe seed generation and the Personalized PageR-ank propagation is novel. Furthermore, it isconsiderable simpler and obtains better andeasier to reproduce results than previous auto-matic approaches (Esuli and Sebastiani, 2007;Mohammad et al., 2009;Rao and Ravichandran, 2009).Next section reviews previous related work, tak-ing special interest on those that are currentlyavailable for evaluation purposes. Section 3 de-scribes the qwn-ppv method to automatically gen-erate lexicons. The resulting lexical resources are evaluated in section 4. We ﬁnish with some con-cluding remarks and future work in section 5.

There is a large amount of work on Senti-ment Analysis and Opinion Mining, and goodcomprehensive overviews are already available(Pang and Lee, 2008; Liu, 2012), so we will re-view the most representative and closest to thepresent work. This means that we will not be re-viewing corpus-based approaches but rather thoseconstructed manually or upon a dictionary orLKB. We will in turn use the approaches here re-viewed for comparison with qwn-ppv in section 4.The most popular manually-built polar-ity lexicon is part of the General Inquirer(Stone et al., 1966), and consists of 1915 wordslabelled as “positive” and 2291 as “negative”.Taboada et al. (2010) manually created theirlexicons annotating the polarity of 6232 wordson a scale of 5 to -5. Liu et al. , starting with Huand Liu (2004), have along the years collecteda manually corrected polarity lexicon whichis formed by 4818 negative and 2041 positivewords. Another manually corrected lexicon(Riloff and Wiebe, 2003) is the one used by theOpinion Finder system (Wilson et al., 2005) andcontains 4903 negatively and 2718 positivelyannotated words respectively.Among the automatically built lexicons, Tur-ney and Littman (2003) proposed a minimally su-pervised algorithm to calculate the polarity of aword depending on whether it co-ocurred morewith a previously collected small set of posi-tive words rather than with a set of negativeones. Agerri and Garc´ıa Serrano presented avery simple method to extract the polarity infor-mation starting from the quality synset in Word-Net (Agerri and Garc´ıa-Serrano, 2010). Moham-mad et al. (2009) developed a method in whichthey ﬁrst identify (by means of afﬁxes rules) a setof positive/negative words which act as seeds, thenused a Roget-like thesaurus to mark the synony-mous words for each polarity type and to gener-alize from the seeds. They produce several lexi-cons the best of which, MSOL(ASL and GI) con-tains 51K and 76K entries respectively and usesthe full General Inquirer as seeds. They performedboth intrinsic and extrinsic evaluations using theMPQA 1.1 corpus.Finally, there are two approaches that are some-hat closer to us, because they are based onWordNet and graph-based methods. SentiWord-Net 3.0 (Baccianella et al., 2010) is built in 4steps: (i) they select the synsets of 14 paradig-matic positive and negative words used as seeds(Turney and Littman, 2003). These seeds are theniteratively extended following the construction ofWordNet-Affect (Strapparava and Valitutti, 2004).(ii) They train 7 supervised classiﬁers with thesynsets’ glosses which are used to assign polar-ity and objectivity scores to WordNet senses. (iii)In SentiWordNet 3.0 (Esuli and Sebastiani, 2007)they take the output of the supervised classiﬁersas input to applying PageRank to WordNet 3.0’sgraph. (iv) They intrinsically evaluate it with re-spect to MicroWnOp-3.0 using the p-normalizedKendall τ distance (Baccianella et al., 2010). Raoand Ravichandran (2009) apply different semi-supervised graph algorithms (Mincuts, Random-ized Mincuts and Label Propagation) to a set ofseeds constructed from the General Inquirer. Theyevaluate the generated lexicons intrinsically takingthe General Inquirer as the gold standard for thosewords that had a match in the generated lexicons.In this paper, we describe two methods to au-tomatically generate seeds either by followingAgerri and Garc´ıa-Serrano (2010) or using Tur-ney and Littman’s (2003) seeds. The automati-cally obtained seeds are then fed into a Person-alized PageRank algorithm which is applied overa WordNet projected on a graph. This method isfully automatic, simple and unsupervised as it onlyrelies on the availability of a LKB. The overall procedure of our approach consists oftwo steps: (1) automatically creates a set of seedsby iterating over a LKB (e.g. a WordNet) rela-tions; and (2) uses the seeds to initialize contextsto propagate over the LKB graph using a Personal-ized Pagerank algorithm. The result is qwn-ppv(s) :Q-WordNets as Personalized PageRanking Vec-tors.

We generate seeds by means of two different auto-matic procedures.1. AG : We start at the quality synset of WordNetand iterate over WordNet relations followingthe original Q-WordNet method described inAgerri and Garc´ıa Serrano (2010). 2. TL : We take a short manually createdlist of 14 positive and negative words(Turney and Littman, 2003) and iterate overWordNet using ﬁve relations: antonymy, sim-ilarity, derived-from, pertains-to and also-see .The AG method starts the propagation fromthe attributes of the quality synset in WordNet.There are ﬁve noun quality senses in WordNet,two of which contain attribute relations (to adjec-tives). From the quality n synset the attribute re-lation takes us to positive a , negative a , good a and bad a ; quality n leads to the attributes superior a and inferior a . The following step is to iterate throughevery WordNet relation collecting (i.e., annotat-ing) those synsets that are accessible from theseeds. Both AG and TL methods to generate seedsrely on a number of relations to obtain a more bal-anced POS distribution in the output synsets. Theoutput of both methods is a list of (assumed to be)positive and negative synsets. Depending on thenumber of iterations performed a different numberof seeds to feed UKB is obtained. Seed numbersvary from 100 hundred to 10K synsets. Both seedcreation methods can be applied to any WordNet,not only Princeton WordNet, as we show in sec-tion 4. The second and last step to generate qwn-ppv (s)consists of propagating over a WordNet graph toobtain a Personalized PageRanking Vector (PPV),one for each polarity. This step requires:1. A LKB projected over a graph.2. A Personalized PageRanking algorithmwhich is applied over the graph.3. Seeds to create contexts to start the propaga-tion, either words or synsets.Several undirected graphs based on Word-Net 3.0 as represented by the MCR 3.0(Agirre et al., 2012) have been created for the ex-perimentation, which correspond to 4 main sets:(G1) two graphs consisting of every synset linkedby the synonymy and antonymy relations; (G2) agraph with the nodes linked by every relation, in-cluding glosses; (G3) a graph consisting of thesynsets linked by every relation except those that ynset Level Word level

Positives Negatives Positives Negatives

Lexicon size P R F P R F size P R F P R F

Automatically created

MSOL(ASL-GI)* 32706 .65 .45 .53 .58 .76 .66 76400 .70 .49 .58 .61 .79 .69QWN 15508 .69 .53 .60 .62 .76 .68 11693 .64 .53 .58 .60 .70 .65SWN 27854 .73 .57 .64 .65 .79 .71 38346 .70 .55 .62 .63 .77 .69QWN-PPV-AG(s03 G1/w01 G1) 2589 .77 .63 .69 .69 .81 .74 5119 .68 .77 .72 .73 .64 .68QWN-PPV-TL (s04 G1/w01 G1) 5010 .76 .66 .70 .70 .79 .74 (Semi-) Manually created

GI* 2791 .74 .57 .64 .65 .80 .72 3376 .79 .64 .71 .70 .83 .76OF* 4640 .77 .61 .68 .68 .81 .74 6860 .82 .71 .76 .74 .84 .79

Liu* .71 .70 .85 .76 .79 .77 .87 .82

SO-CAL* 4212 .75 .57 .64 .65 .81 .72 6226 .82 .70 .76 .74 .85 .79

Table 1: Evaluation of lexicons at document level using Bespalov’s Corpus.are linked by antonymy ; ﬁnally, (G4) a graph con-sisting of the nodes related by every relation ex-cept the antonymy and gloss relations.Using the (G1) graphs, we propagate from theseeds over each type of graph (synonymy andantonymy) to obtain two rankings per polarity.The graphs created in (G2), (G3) and (G4) areused to obtain two ranks, one for each polarity bypropagating from the seeds. In all four cases thedifferent polarity rankings have to be combined inorder to obtain a ﬁnal polarity lexicon: the polar-ity score pol(s) of a given synset s is computedby adding its scores in the positive rankings andsubtracting its scores in the negative rankings. If pol(s) > then s is included in the ﬁnal lexiconas positive. If pol(s) < then s is included in theﬁnal lexicon as negative. We assume that synsetswith null polarity scores have no polarity and con-sequently they are excluded from the ﬁnal lexicon.The Personalized PageRanking propagation isperformed starting from both synsets and wordsand using both AG and TL styles of seed gen-eration, as explained in section 3.1. Combin-ing the various possibilities will produce at least6 different lexicons for each iteration, dependingon which decisions are taken about which graph,seeds and word/synset to create the qwn-ppv (s). Infact, the experiments produced hundreds of lexi-cons, according to the different iterations for seedgeneration , but we will only refer to those thatobtain the best results in the extrinsic evaluations.With respect to the algorithm to propagate overthe WordNet graph from the automatically createdseeds, we use a Personalized PageRank algorithm(Agirre et al., 2014; Agirre and Soroa, 2009). Thefamous PageRank (Brin and Page, 1998) algo-rithm is a method to produce a rank from the ver- The total time to generate the ﬁnal 352 QWN-PPV prop-agations amounted to around two hours of processing time ina standard PC. tices in a graph according to their relative struc-tural importance. PageRank has also been viewedas the result of a Random Walk process, where theﬁnal rank of a given node represents the probabil-ity of a random walk over the graph which ends onthat same node. Thus, if we take the created Word-Net graph G with N vertices v , . . . , v n and d i asbeing the outdegree of node i , plus a N × N tran-sition probability matrix M where M ji = 1 /d i if a link from i to j exists and 0 otherwise, thencalculating the PageRank vector over a graph G amounts to solve the following equation (1): Pr = cM Pr + (1 − c ) v (1)In the traditional PageRank, vector v is a uni-form normalized vector whose elements values areall /N , which means that all nodes in the graphare assigned the same probabilities in case of arandom walk. Personalizing the PageRank algo-rithm in this case means that it is possible to makevector v non-uniform and assign stronger proba-bilities to certain nodes, which would make thealgorithm to propagate the initial importance ofthose nodes to their vicinity. Following Agirre etal. (2014), in our approach this translates into ini-tializing vector v with those senses obtained by theseed generation methods described above in sec-tion 3.1. Thus, the initialization of vector v us-ing the seeds allows the Personalized propagationto assign greater importance to those synsets inthe graph identiﬁed as being positive and negative,which resuls in a PPV with the weigths skewed to-wards those nodes initialized/personalized as pos-itive and negative. Previous approaches have provided intrin-sic evaluation (Mohammad et al., 2009;Rao and Ravichandran, 2009; ynset Level Word level

Positives Negatives Positives Negatives

Lexicon size P R F P R F size P R F P R F

Automatically created

MSOL(ASL-GI) * 32706 .56 .37 .44 .76 .87 .81 76400 .67 .5 .57 .80 .89 .85QWN 15508 .63 .22 .33 .73 .94 .83 11693 .58 .22 .31 .73 .93 .82SWN 27854 .57 .33 .42 .75 .89 .81 38346 .55 .55 .55 .80 .8 .80QWN-PPV-AG (w10 G3/s09 G4) 117485 .60 .63 .62 .83 .82 .83 .59 .81 .88 .84 (Semi-) Manually created

GI* 2791 .70 .32 .44 .76 .94 .84 3376 .71 .56 .62 .82 .90 .86 OF * 4640 .67 .37 .48 .77 .92 .84 .71 .87 .90 .88 Liu* 4127 .67 .33 .44 .76 .93 .83 6786 .78 .45 .57 .79 .94 .86SO-CAL* 4212 .69 .3 .42 .75 .94 .84 6226 .73 .53 .61 .81 .91 .86

Table 2: Evaluation of lexicons using averaged ratio on the MPQA 1.2 test

Corpus.Baccianella et al., 2010) using manually an-notated resources such as the General Inquirer(Stone et al., 1966) as gold standard. To facilitatecomparison, we also provide such evaluation insection 4.3. Nevertheless, and as demonstratedby the results of the extrinsic evaluations, webelieve that polarity lexicons should in generalbe evaluated extrinsically . After all, any polaritylexicon is as good as the results obtained by usingit for a particular Sentiment Analysis task.Our goal is to evaluate the polarity lexiconssimplifying the evaluation parameters to avoid asmany external inﬂuences as possible on the re-sults. We compare our work with most of thelexicons reviewed in section 2, both at synsetand word level, both manually and automaticallygenerated: General Inquirer ( GI ), Opinion Finder( OF ), Liu , Taboada et al.’s ( SO-CAL ), Agerriand Garc´ıa-Serrano (2010) (

QWN ), Mohammad et al’s , (

MSOL(ASL-GI) ) and SentiWordNet 3.0(

SWN ). The results presented in section 4.2 showthat extrinsic evaluation is more meaningful to de-termine the adequacy of a polarity lexicon for aspeciﬁc Sentiment Analysis task.

Three different corpora were used: Bespalov etal.’s (2011) and MPQA (Riloff and Wiebe, 2003)for English, and HOpinion in Spanish. In addi-tion, we divided the corpus into two subsets (75%development and 25% test) for applying our ratiosystem for the phrase polarity task too. Note thatthe development set is only used to set up the po-larity classiﬁcation task, and that the generation of qwn-ppv lexicons is unsupervised.For Spanish we tried to reproduce the Englishsettings with Bespalov’s corpus. Thus, both devel-opment and test sets were created from the HOpin- http://clic.ub.edu/corpus/hopinion ion corpus. As it contains a much higher propor-tion of positive reviews, we created also subsetswhich contain a balanced number of positive andnegative reviews to allow for a more meaningfulcomparison than that of table 6. Table 3 shows thenumber of documents per polarity for Bespalov’s,MPQA 1.2 and HOpinion. Corpus POS docs NEG docs TotalBespalov dev test dev test

771 1,753 2,524MPQA 1.2 total dev test

528 528 1,056HOpinion dev test

Table 3: Number of positive and negative docu-ments in train and test sets.We report results of 4 extrinsic evaluations ortasks, three of them based on a simple ratio av-erage system , inspired by Turney (2002), and an-other one based on Mohammad et al. (2009). Weﬁrst implemented a simple average ratio classiﬁer which computes the average ratio of the polaritywords found in document d : polarity ( d ) = P w ∈ d pol ( w ) | d | (2)where, for each polarity, pol(w) is if w is in-cluded in the polarity lexicon and otherwise.Documents that reach a certain threshold are clas-siﬁed as positive, and otherwise as negative. Tosetup an evaluation enviroment as fair as possi-ble for every lexicon, the threshold is optimised bymaximising accuracy over the development data.Second, we implemented a phrase polarity taskidentiﬁcation as described by Mohammad et al. (2009). Their method consists of: (i) if any ofthe words in the target phrase is contained in the ynset Level Word level Positives Negatives Positives Negatives

Lexicon size P R F P R F size P R F P R F

Automatically created

MSOL(ASL-GI)* 32706 .52 .48 .50 .85 .62 .71 76400 .68 .56 .62 .82 .86 .84QWN 15508 .50 .36 .42 .84 .32 .46 11693 .45 .49 .47 .78 .51 .61SWN 27854 .50 .45 .47 .85 .48 .61 38346 .49 .52 .50 .78 .68 .73

QWN-PPV-AG (s09 G3/w02 G3) 117485 .59 .67 .63 .85 .78 .82

QWN-PPV-TL (w02 G3/s06 G3) 117485 .59 .57 .58 .82 .81 .81 147194 .63 .67 .65 .85 .81 .83 (Semi-) Manually created

GI* 2791 .60 .40 .47 .91 .38 .54 3376 .70 .60 .65 .93 .52 .67 OF * 4640 .63 .42 .50 .93 .46 .62 .73 .95 .66 .78 Liu* 4127 .65 .36 .47 .94 .45 .60 6786 .78 .49 .60 .97 .61 .75SO-CAL* 4212 .65 .37 .47 .92 .45 .60 6226 .73 .57 .64 .96 .59 .73

Table 4: Evaluation of lexicons at phrase level using Mohammad et al.’s (2009) method on MPQA1.2 total

Corpus.negative lexicon, then the polarity is negative; (ii)if none of the words are negative, and at least oneword is in the positive lexicon, then is positive;(iii) the rest are not tagged.We chose this very simple polarity estimatorsbecause our aim was to minimize the role otheraspects play in the evaluation and focus on how,other things being equal, polarity lexicons performin a Sentiment Analysis task. The average ratio is used to present results of tables 1 and 2 (withBespalov corpus), and 5 and 6 (with HOpinion),whereas Mohammad et al.’s is used to report re-sults in table 4. Mohammad et al.’s (2009) testsetbased on MPQA 1.1 is smaller, but both MPQA1.1 and 1.2 are hugely skewed towards negativepolarity (30% positive vs. 70% negative).All datasets were POS tagged andWord Sense Disambiguated usingFreeLing (Padr´o and Stanilovsky, 2012;Agirre and Soroa, 2009). Having word senseannotated datasets gives us the opportunity toevaluate the lexicons both at word and sense lev-els. For the evaluation of those lexicons that aresynset-based, such as qwn-ppv and SentiWordNet3.0, we convert them from senses to words bytaking every word or variant contained in eachof their senses. Moreover, if a lemma appearsas a variant in several synsets the most frequentpolarity is assigned to that lemma.With respect to lexicons at word level, we takethe most frequent sense according to WordNet 3.0for each of their positive and negative words. Notethat the latter conversion, for synset based evalua-tion, is mostly done to show that the evaluation atsynset level is harder independently of the qualityof the lexicon evaluated.

Although tables 1, 2 and 4 also present re-sults at synset level, it should be noted thatthe only polarity lexicons available to us forcomparison at synset level were Q-WordNet(Agerri and Garc´ıa-Serrano, 2010) and Senti-WordNet 3.0 (Baccianella et al., 2010).

QWN-PPV-AG refers to the lexicon generated startingfrom AG ’s seeds, and QWN-PPV-TL using TL ’sseeds as described in section 3.1. Henceforth, wewill use qwn-ppv to refer to the overall methodpresented in this paper, regardless of the seedsused.For every qwn-ppv result reported in this sec-tion, we have used every graph described in sec-tion 3.2. The conﬁguration of each qwn-ppv in theresults speciﬁes which seed iteration is used as theinitialization of the Personalized PageRank algo-rithm, and on which graph. Thus, QWN-PPV-TL(s05 G4) in table 2 means that the 5th iteration ofsynset seeds was used to propagate over graph G4.If the conﬁguration were (w05 G4) it would havemeant ‘the 5th iteration of word seeds were usedto propagate over graph G4’. The simplicity ofour approach allows us to generate many lexiconssimply by projecting a LKB over different graphs.The lexicons marked with an asterisk denotethose that have been converted from word tosenses using the most frequent sense of WordNet3.0. We would like to stress again that the purposeof such word to synset conversion is to show thatSA tasks at synset level are harder than at wordlevel. In addition, it should also be noted thatin the case of SO-CAL (Taboada et al., 2010), wehave reduced what is a graded lexicon with scoresranging from 5 to -5 into a binary one.Table 1 shows that (at least partially) manu-ally built lexicons obtain the best results on thisvaluation. It also shows that qwn-ppv clearlyoutperforms any other automatically built lexi-cons. Moreover, manually built lexicons sufferfrom the evaluation at synset level, obtaining mostof them lower scores than qwn-ppv , although Liu’s(Hu and Liu, 2004) still obtains the best results. Inany case, for an unsupervised procedure, qwn-ppv lexicons obtain very competitive results with re-spect to manually created lexicons and is the bestamong the automatic methods. It should also benoted that the best results of qwn-ppv are obtainedwith graph G1 and with very few seed iterations.Table 2 again sees the manually built lexi-cons performing better although overall the dif-ferences are lower with respect to automaticallybuilt lexicons. Among these, qwn-ppv again ob-tains the best results, both at synset and wordlevel, although in the latter the differences withMSOL(ASL-GI) are not large. Finally, table 4shows that qwn-ppv again outperforms other auto-matic approaches and is closer to those have been(partially at least) manually built. In both MPQAevaluations the best graph overall to propagate theseeds is G3 because this type of task favours highrecall. Positives Negatives

Lexicon size P R F P R F

Automatically created

SWN 27854 .87 .99 .93 .70 .16 .27QWN-PPV-AG(wrd01 G1) 3306 .86 .00 .92 .67 .01 .02QWN-PPV-TL(s04 G1) 5010 .89 .96 .93 .58 .30 .39

Table 5: Evaluation of Spanish lexicons using thefull HOpinion corpus at synset level.We report results on the Spanish HOpinion cor-pus in tables 5 and 6. Mihalcea(f) is a manu-ally revised lexicon based on the automaticallybuilt Mihalcea(m) (P´erez-Rosas et al., 2012). Elh-Polar (Saralegi and San Vicente, 2013) is semi-automatically built and manually corrected. SO-CAL is built manually. SWN and QWN-PPV havebeen built via the MCR 3.0’s ILI by applying thesynset to word conversion previously described onthe Spanish dictionary of the MCR. The results forSpanish at word level in table 6 show the sametrend as for English: qwn-ppv is the best of theautomatic approaches and it obtains competitivealthough not as good as the best of the manuallycreated lexicons (ElhPolar). Due to the dispro-portionate number of positive reviews, the results for the negative polarity are not useful to draw anymeaningful conclusions. Thus, we also performedan evaluation with HOpinion Balanced set as listedin table 3.

Positives Negatives

Lexicon size P R F P R F

Automatically created

Mihalcea(m) 2496 .86 .00 .92 .00 .00 .00SWN 9712 .88 .97 .92 .55 .19 .28QWN-PPV-AG(s11 G1) 1926 .89 .97 .93 .59 .26 .36QWN-PPV-TL(s03 G1) 939 .89 .98 .93 .71 .26 .38 (Semi-) Manually created

ElhPolar 4673 .94 .94 .94 .64 .64 .64

Mihalcea(f) 1347 .91 .96 .93 .61 .41 .49SO-CAL 4664 .92 .96 .94 .70 .51 .59

Table 6: Evaluation of Spanish lexicons using thefull HOpinion corpus at word level.The results with a balanced HOpinion, notshown due to lack of space, also conﬁrm the pre-vious trend: qwn-ppv outperforms other automaticapproaches but is still worse than the best of themanually created ones (ElhPolar).

To facilitate intrinsic comparison with previousapproaches, we evaluate our automatically gener-ated lexicons against GI. For each qwn-ppv lex-icon shown in previous extrinsic evaluations, wecompute the intersection between the lexicon andGI, and evaluate the words in that intersection. Ta-ble 7 shows results for the best-performing QWN-PPV lexicons (both using AG and TL seeds) inthe extrinsic evaluations at word level of tables 1(ﬁrst two rows), 2 (rows 3 and 4) and 4 (rows 5and 6). We can see that QWN-PPV lexicons sys-tematically outperform SWN in number of correctentries.

QWN-PPV-TL lexicons obtain 75.04%of correctness on average. The best performinglexicon contains up to 81.07% of correct entries.Note that we did not compare the results withMSOL(ASL-GI) because it contains the GI.

QWN-PPV lexicons obtain the best results amongthe evaluations for English and Spanish. Further-more, across tasks and datasets qwn-ppv providesa more consistent and robust behaviour than mostof the manually-built lexicons apart from OF. Theresults also show that for a task requiring highrecall the larger graphs, e.g. G3, are preferable, exicon ∩ wrt. GI Acc. Pos Neg SWN 2,755 .74 .76 .73QWN-PPV-AG (w01 G1) 849 .71 .68 .75QWN-PPV-TL (w01 G1) 713 .78 .80 .76QWN-PPV-AG (s09 G4) 3,328 .75 .75 .77QWN-PPV-TL (s05 G4) 3,333 .80 .84 .77

QWN-PPV-AG (w02 G3) 3,340 .74 .71 .77QWN-PPV-TL (s06 G3) 3,340 .77 .79 .77

Table 7: Accuracy QWN-PPV lexicons and SWNwith respect to the GI lexicon.whereas for a more balanced dataset and documentlevel task smaller G1 graphs perform better.These are good results considering that ourmethod to generate qwn-ppv is simpler, more ro-bust and adaptable than previous automatic ap-proaches. Furthermore, although also based ona Personalized PageRank application, it is muchsimpler than SentiWordNet 3.0, consistently out-performed by qwn-ppv on every evaluation anddataset. The main differences with respect to Sen-tiWordNet’s approach are the following: (i) theseed generation and training of 7 supervised clas-siﬁers corresponds in qwn-ppv to only one simplestep, namely, the automatic generation of seedsas explained in section 3.1; (ii) the generationof qwn-ppv only requires a LKB’s graph for thePersonalized PageRank propagation, no disam-biguated glosses; (iii) the graph they use to dothe propagation also depends on disambiguatedglosses, not readily available for any language.The fact that qwn-ppv is based on alreadyavailable WordNets projected onto simple graphsis crucial for the robustness and adaptability ofthe qwn-ppv method across evaluation tasks anddatasets: Our method can quickly create, over dif-ferent graphs, many lexicons of diffent sizes whichcan then be evaluated on a particular polarity clas-siﬁcation task and dataset. Hence the differentconﬁgurations of the qwn-ppv lexicons, becausefor some tasks a G3 graph with more AG/TL seediterations will obtain better recall and viceversa.This is conﬁrmed by the results: the tasks usingMPQA seem to clearly beneﬁt from high recallwhereas the Bespalov’s corpus has overall, morebalanced scores. This could also be due to the sizeof Bespalov’s corpus, almost 10 times larger thanMPQA 1.2.The experiments to generate Spanish lexiconsconﬁrm the trend showed by the English evalua-tions: Lexicons generated by qwn-ppv consistenlyoutperform other automatic approaches, although some manual lexicon is better on a given task anddataset (usually a different one). Nonetheless theSpanish evaluation shows that our method is alsorobust across languages as it gets quite close tothe manually corrected lexicon of Mihalcea(full)(P´erez-Rosas et al., 2012).The results also conﬁrm that no single lexicon isthe most appropriate for any SA task or dataset anddomain. In this sense, the adaptability of qwn-ppv is a desirable feature for lexicons to be employedin SA tasks: the unsupervised qwn-ppv methodonly relies on the availability of a LKB to buildhundreds of polarity lexicons which can then beevaluated on a given task and dataset to choose thebest ﬁt. If not annotated evaluation set is avail-able, G3-based propagations provide the best re-call whereas the G1-based lexicons are less noisy.Finally, we believe that the results reported herepoint out to the fact that intrinsic evaluations arenot meaningful to judge the adequacy a polaritylexicon for a speciﬁc SA task.

This paper presents an unsupervised dictionary-based method qwn-ppv to automatically generatepolarity lexicons. Although simpler than similarautomatic approaches, it still obtains better resultson the four extrinsic evaluations presented. Be-cause it only depends on the availability of a LKB,we believe that this method can be valuable to gen-erate on-demand polarity lexicons for a given lan-guage when not sufﬁcient annotated data is avail-able. We demonstrate the adaptability of our ap-proach by producing good performance polaritylexicons for different evaluation scenarios and formore than one language.Further work includes investigating differentgraph projections of WordNet relations to do thepropagation as well as exploiting synset weights.We also plan to investigate the use of annotatedcorpora to generate lexicons at word level to tryand close the gap with those that have been (atleast partially) manually annotated.The qwn-ppv lexicons and graphsused in this paper are publiclyavailable (under CC-BY license): . The qwn-ppv tool to automatically generatepolarity lexicons given a WordNet in any languagewill soon be available in the aforementioned URL. cknowledgements

This work has been supported by the OpeNER FP7project under Grant No. 296451, the FP7 News-Reader project, Grant No. 316404 and by theSpanish MICINN project SKATER under GrantNo. TIN2012-38584-C06-01.

References [Agerri and Garc´ıa-Serrano2010] R. Agerri andA. Garc´ıa-Serrano. 2010. Q-WordNet: extractingpolarity from WordNet senses. In

Seventh Con-ference on International Language Resources andEvaluation, Malta. Retrieved May , volume 25, page2010.[Agirre and Soroa2009] Eneko Agirre and Aitor Soroa.2009. Personalizing pagerank for word sense disam-biguation. In

Proceedings of the 12th Conference ofthe European Chapter of the Association for Compu-tational Linguistics (EACL-2009) , Athens, Greece.[Agirre et al.2012] Aitor Gonz´alez Agirre, Egoitz La-parra, German Rigau, and Basque Country Donos-tia. 2012. Multilingual central repository ver-sion 3.0: upgrading a very large lexical knowledgebase. In

GWC 2012 6th International Global Word-net Conference , page 118.[Agirre et al.2014] Eneko Agirre, Oier Lopez de La-calle, and Aitor Soroa. 2014. Random walks forknowledge-based word sense disambiguation.

Com-putational Linguistics , (Early Access).[Baccianella et al.2010] S. Baccianella, A. Esuli, andF. Sebastiani. 2010. SentiWordNet 3.0: An en-hanced lexical resource for sentiment analysis andopinion mining. In

Seventh conference on Interna-tional Language Resources and Evaluation (LREC-2010), Malta. , volume 25.[Bespalov et al.2011] Dmitriy Bespalov, Bing Bai, Yan-jun Qi, and Ali Shokoufandeh. 2011. Senti-ment classiﬁcation based on supervised latent n-gram analysis. In

Proceedings of the 20th ACMinternational conference on Information and knowl-edge management , pages 375–382.[Bond and Foster2013] Francis Bond and Ryan Foster.2013. Linking and extending an open multilingualwordnet. In .[Brin and Page1998] Sergey Brin and Lawrence Page.1998. The anatomy of a large-scale hypertextualweb search engine.

Computer networks and ISDNsystems , 30(1):107117.[Choi and Cardie2009] Y. Choi and C. Cardie. 2009.Adapting a polarity lexicon using integer linear pro-gramming for domain-speciﬁc sentiment classiﬁca-tion. In

Proceedings of the 2009 Conference on Em-pirical Methods in Natural Language Processing:Volume 2-Volume 2 , pages 590–598. [Ding et al.2008] X. Ding, B. Liu, and P. S. Yu. 2008.A holistic lexicon-based approach to opinion min-ing. In

Proceedings of the international conferenceon Web search and web data mining , pages 231–240.[Esuli and Sebastiani2007] Andrea Esuli and FabrizioSebastiani. 2007. Pageranking wordnet synsets: Anapplication to opinion mining. In

Proceedings of the45th Annual Meeting of the Association of Compu-tational Linguistics , pages 424–431, Prague, CzechRepublic, June. Association for Computational Lin-guistics.[Fellbaum and Miller1998] C. Fellbaum and G. Miller,editors. 1998.

Wordnet: An Electronic LexicalDatabase . MIT Press, Cambridge (MA).[Hatzivassiloglou and McKeown1997] V. Hatzivas-siloglou and K. R McKeown. 1997. Predicting thesemantic orientation of adjectives. In

Proceedingsof the eighth conference on European chapter of theAssociation for Computational Linguistics , pages174–181.[Hu and Liu2004] M. Hu and B. Liu. 2004. Miningand summarizing customer reviews. In

Proceed-ings of the tenth ACM SIGKDD international con-ference on Knowledge discovery and data mining ,pages 168–177.[Kim and Hovy2004] Soo-Min Kim and Eduard Hovy.2004. Determining the sentiment of opinions.In

Proceedings of Coling 2004 , pages 1367–1373,Geneva, Switzerland, Aug 23–Aug 27. COLING.[Liu2012] Bing Liu. 2012. Sentiment analysis andopinion mining.

Synthesis Lectures on Human Lan-guage Technologies , 5(1):1–167.[Mihalcea et al.2007] R. Mihalcea, C. Banea, andJ. Wiebe. 2007. Learning multilingual subjectivelanguage via cross-lingual projections. In

AnnualMeeting of the Association for Computational Lin-guistics , volume 45, page 976.[Mohammad et al.2009] S. Mohammad, C. Dunne, andB. Dorr. 2009. Generating high-coverage seman-tic orientation lexicons from overtly marked wordsand a thesaurus. In

Proceedings of the 2009 Con-ference on Empirical Methods in Natural LanguageProcessing: Volume 2-Volume 2 , pages 599–608.[Padr´o and Stanilovsky2012] Llu´ıs Padr´o and EvgenyStanilovsky. 2012. Freeling 3.0: Towards widermultilinguality. In

Proceedings of the Language Re-sources and Evaluation Conference (LREC 2012) ,Istanbul, Turkey, May. ELRA.[Pang and Lee2008] B. Pang and L. Lee. 2008. Opin-ion mining and sentiment analysis.

Foundations andTrends in Information Retrieval , 2(1-2):1–135.[P´erez-Rosas et al.2012] Ver´onica P´erez-Rosas, Car-men Banea, and Rada Mihalcea. 2012. Learn-ing sentiment lexicons in spanish. In

LREC , pages3077–3081.Rao and Ravichandran2009] D. Rao and D. Ravichan-dran. 2009. Semi-supervised polarity lexicon in-duction. In

Proceedings of the 12th Conference ofthe European Chapter of the Association for Com-putational Linguistics , pages 675–682.[Riloff and Wiebe2003] E. Riloff and J. Wiebe. 2003.Learning extraction patterns for subjective expres-sions. In

Proceedings of the International Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP’03) .[Saralegi and San Vicente2013] Xabier Saralegi andI˜naki San Vicente. 2013. Elhuyar at TASS2013.In

XXIX Congreso de la Sociedad Espaola deProcesamiento de lenguaje natural, Workshop onSentiment Analysis at SEPLN (TASS2013) , pages143–150, Madrid.[Stone et al.1966] P. Stone, D. Dunphy, M. Smith, andD. Ogilvie. 1966.

The General Inquirer: A Com-puter Approach to Content Analysis . Cambridge(MA): MIT Press.[Strapparava and Valitutti2004] Carlo Strapparava andAlessandro Valitutti. 2004. Wordnet-affect: anaffective extension of wordnet. In

Proceedings ofthe 4th International Conference on Languages Re-sources and Evaluation (LREC 2004) , pages 1083–1086, Lisbon, May.[Taboada et al.2010] M. Taboada, J. Brooke,M. Toﬁloski, K. Voll, and M. Stede. 2010.Lexicon-based methods for sentiment analysis.

Computational Linguistics , (Early Access):141.[Takamura et al.2005] Hiroya Takamura, Takashi Inui,and Manabu Okumura. 2005. Extracting seman-tic orientations of words using spin model. In

Pro-ceedings of the 43rd Annual Meeting of the Associ-ation for Computational Linguistics (ACL’05) , page133140, Ann Arbor, Michigan, June. Association forComputational Linguistics.[Turney and Littman2003] P. Turney and M. Littman.2003. Measuring praise and criticism: Inference ofsemantic oreintation from association.

ACM Trans-action on Information Systems , 21(4):315–346.[Turney2002] P.D. Turney. 2002. Thumbs up orthumbs down?: semantic orientation applied to un-supervised classiﬁcation of reviews. In

Proceedingsof the 40th Annual Meeting on Association for Com-putational Linguistics , page 417424.[Wilson et al.2005] Theresa Wilson, Janyce Wiebe, andPaul Hoffmann. 2005. Recognizing contextual po-larity in phrase-level sentiment analysis. In