Identifying Purpose Behind Electoral Tweets
aa r X i v : . [ c s . C L ] N ov Identifying Purpose Behind Electoral Tweets
Saif M. Mohammad, Svetlana Kiritchenko, and Joel MartinNational Research Council CanadaOttawa, Ontario, Canada K1A 0R6 {saif.mohammad,svetlana.kiritchenko,joel.martin}@nrc-cnrc.gc.ca
ABSTRACT
Tweets pertaining to a single event, such as a national elec-tion, can number in the hundreds of millions. Automati-cally analyzing them is beneficial in many downstream nat-ural language applications such as question answering andsummarization. In this paper, we propose a new task: iden-tifying the purpose behind electoral tweets—why do peo-ple post election-oriented tweets? We show that identifyingpurpose is correlated with the related phenomenon of senti-ment and emotion detection, but yet significantly different.Detecting purpose has a number of applications includingdetecting the mood of the electorate, estimating the popu-larity of policies, identifying key issues of contention, andpredicting the course of events. We create a large datasetof electoral tweets and annotate a few thousand tweets forpurpose. We develop a system that automatically classifieselectoral tweets as per their purpose, obtaining an accuracyof 43.56% on an 11-class task and an accuracy of 73.91% ona 3-class task (both accuracies well above the most-frequent-class baseline). Finally, we show that resources developedfor emotion detection are also helpful for detecting purpose.
1. INTRODUCTION
The number of tweets pertaining to a single event or topicsuch as a national election, a natural disaster, or gun con-trol laws, can grow to the hundreds of millions. The largenumber of tweets negates the possibility of a single personreading all of them to gain an overall global perspective.Thus, automatically analyzing tweets is beneficial in manydownstream natural language applications such as questionanswering and summarization.An important facet in understanding tweets is the questionof ‘Why?’, that is, what is the purpose of the tweet? Therehas been some prior work in this regard [1, 23, 34], however,they have focused on the general motivations and reasonsfor tweeting. For example, Naaman et al. [23] proposed thecategories of: information sharing, self promotion, opinions,statements, me now, questions, presence maintenance, anec- dote (me), and anecdote (others). On the other hand, thedominant reasons for tweeting vary when tweeting aboutspecific topics and events. For example, the reasons fortweeting in national elections are very different from thereasons for tweeting during a natural disaster, such as anearthquake.There is growing interest in analyzing political tweets inparticular because of a number of applications such as de-termining political alignment of tweeters [14, 10], identify-ing contentious issues and political opinions [20], detectingthe amount of polarization in the electorate [11], and so on.There is even a body of work claiming that analyzing polit-ical tweets can help predict the outcome of elections [4, 38].However, that claim is questioned by more recent work [2].In this paper, we propose the task of identifying the purposebehind electoral tweets. For example, some tweets are meantto criticize, some to praise, some to express disagreement,and so on. Determining the purpose behind electoral tweetscan help many applications such as those listed above. Thereare many reasons why people criticize, praise, etc, but thatis beyond the scope of this paper. For discussions on usersatisfaction from tweets we refer the reader to Liu, Cheung,and Lee [19] and Cheung and Lee [7].First, we automatically compile a dataset of electoral tweetsusing a few hand-chosen hashtags. We choose the 2012 USpresidential elections as our target domain. We develop aquestionnaire to annotate tweets for purpose by crowdsourc-ing. We analyze the annotations to determine the distribu-tions of different kinds of purpose. We develop a preliminarysystem that automatically classifies electoral tweets as pertheir purpose, using various features that have traditionallybeen used in tweet classification, such as word ngrams andelongated words, as well as features pertaining to eight ba-sic emotions. We show that resources developed for emotiondetection are also helpful for detecting purpose.We then add to this system features pertaining to hundredsof fine emotion categories. We show that these features leadto significant improvements in accuracy above and beyondthose obtained by the competitive preliminary system. Thesystem obtains an accuracy of 43.56% on a 11-class task andan accuracy of 73.91% on a 3-class task.Finally, we show that emotion detection alone can fail todistinguish between several different types of purpose. Forxample, the same emotion of disgust can be associated withmany different kinds of purpose such as ‘to criticize’, ‘tovent’, and ‘to ridicule’. Thus, detecting purpose providesinformation that is not provided simply by detecting senti-ment or emotion. We publicly release all the data created aspart of this project: about 1 million original tweets on the2012 US elections, about 2,000 tweets annotated for pur-pose, about 1,200 tweets annotated for emotion, and thenew emotion lexicon. We begin with related work (Section 2). We then describehow we collected (Section 3.1) and annotated the data (Sec-tion 3.2). Section 3.3 gives an analysis of the annotationsincluding distributions of various kinds of purpose, inter-annotator agreement, and confusion matrices. In Section3.4, we tease our the partial correlation and the distinctionbetween purpose and affect. In Section 4, we present firsta basic system to classify tweets by purpose (Section 4.1),and then we describe how we created an emotion resourcepertaining to hundreds of emotions and used it to furtherimprove performance of the basic system (Section 4.2) InSection 5, we discuss some of the findings of the automaticclassifiers and also further delineate the relation betweenpurpose detection and emotion detection. We present con-cluding remarks in Section 6.
2. RELATED WORK
There exists considerable work on tweet classification bytopic [33, 18, 24]. Some of the classification work that comesclose to identifying purpose is described below. Alhadi et al.[1] annotated 1000 tweets into the categories of social inter-action with people, promotion or marketing, share resources,give or require feedback, broadcast alert/urgent information,require/raise funding, recruit worker, and express emotions.Naaman et al. [23] organized 3379 tweets into the categoriesof information sharing, self promotion, opinions, statements,me now, questions, presence maintenance, anecdote (me),and anecdote (others). [34] built a system to identify tweetspertaining to breaking news. Sriram et al. [35] annotated5407 tweets into news, events, opinions, deals and privatemessages.Tweet categorization work within a particular domain in-cludes that by Collier, Son, and Nguyen [9], where flu-relatedtweets were classified into avoidance behavior, increased san-itation, seeking pharmaceutical intervention, wearing a mask,and self reported diagnosis, and work by Caragea et al. [5],where earthquake-related tweets were classified into medicalemergency, people trapped, food shortage, water shortage,water sanitation, shelter needed, collapsed structure, fooddistribution, hospital/clinic services, and person news.To the best of our knowledge, there is no work yet on clas-sifying electoral or political tweets into sub-categories. Asmentioned earlier, there exists work on determining polit-ical alignment of tweeters [14, 10], identifying contentiousissues and political opinions [20], detecting the amount ofpolarization in the electorate [11], and detecting sentimentin political tweets [4, 8, 25].Sentiment classification of general (non-domain) tweets has Email Saif Mohammad: [email protected].
Table 1: Query terms used to collect tweets pertain-ing to the 2012 US presidential elections. received much attention [26, 15, 17]. Beyond simply posi-tive and negative sentiment, some recent work also classifiestweets into emotions [16, 21, 31, 37]. Much of this work fo-cused on emotions argued to be the most basic. For exam-ple, Ekman [12] proposed six basic emotions—joy, sadness,anger, fear, disgust, and surprise. Plutchik [30] argued in fa-vor of eight—Ekman’s six, trust, and anticipation. There isless work on complex emotions, such as work by Pearl andSteyvers [29] that focused on politeness, rudeness, embar-rassment, formality, persuasion, deception, confidence, anddisbelief.Many of the automatic emotion classification systems use af-fect lexicons such as the NRC emotion lexicon [22], WordNetAffect [36], and the Affective Norms for English Words. Af-fect lexicons are lists of words and associated emotions andsentiments. We will show that affect lexicons are helpful fordetecting purpose behind tweets as well.
3. DATA COLLECTION AND ANNOTATIONOF PURPOSE
In the subsections below we describe how we collected tweetsposted during the run up to the 2012 US presidential elec-tions and how we annotated them for purpose by crowd-sourcing.
We created a corpus of tweets by polling the Twitter SearchAPI, during August and September 2012, for tweets thatcontained commonly known hashtags pertaining to the 2012US presidential elections. Table 1 shows the query termswe used. Apart from 21 hashtags, we also collected tweetswith the words Obama, Barack, or Romney. We used theseadditional terms because they were the names of the twopresidential candidates. Further, the probability that thesewords were used to refer to someone other than the presi-dential candidates was low.The Twitter Search API was polled every four hours to ob-tain new tweets that matched the query. Close to one milliontweets were collected, which we will make freely available tothe research community. Note that Twitter imposes restric-tions on direct distribution of tweets, but allows the distri-bution of tweet ids. One may download tweets using tweetids and third party tools, provided those tweets have notbeen deleted by the people who posted them. The queryterms which produced the highest number of tweets were http://csea.phhp.ufl.edu/media/anewmessage.htmlhose involving the names of the presidential candidates, aswell as We used Amazon’s Mechanical Turk service to crowdsourcethe annotation of the electoral tweets. We randomly se-lected about 2,000 tweets, each by a different Twitter user.We asked a series of questions for each tweet. Below is thequestionnaire for an example tweet:
Purpose behind US election tweetsTweet:
Mitt Romney is arrogant as hell.Q1. Which of the following best describes the purpose of thistweet? - to point out hypocrisy or inconsistency- to point out mistake or blunder- to disagree- to ridicule- to criticize, but none of the above- to vent- to agree- to praise, admire, or appreciate- to support- to provide information without emotion- none of the aboveQ2. Is this tweet about US politics and elections? • Yes, this tweet is about US politics and elections. • No, this tweet has nothing to do with US politics or any-body involved in it.
These questionnaires are called
HITs (human intelligencetasks) in Mechanical Turk parlance. We posted 2042 HITscorresponding to 2042 tweets. We requested responses fromat least three annotators for each HIT. The response to aHIT by an annotator is called an assignment . In MechanicalTurk, an annotator may provide assignments for as manyHITs as they wish. Thus, even though only three anno-tations are requested per HIT, about 400 annotators con-tribute assignments for the 2,042 tweets. The number ofassignments completed by the annotators followed a zipfiandistribution.Even though it is possible that more than one option mayapply for a tweet, we allowed the Turkers to select only oneoption for each question. We did this to encourage anno-tators to select the option that best answers the questions.We wanted to avoid situations where an annotator selectsmultiple options just because they are vaguely relevant tothe question. Table 2: The histogram of the number of annota-tions of tweets. ‘annotns’ is short for annotations.annotns/tweet ≥ oppose (to point outhypocrisy, to point out mistake, to disagree, to ridicule, tocriticize, to vent), favour (to agree, to praise, to support),and other . Even though there is some redundancy amongthe fine categories, they are more precise and may help anno-tation. Eventually, however, it may be beneficial to combinetwo or more categories for the purposes of automatic classi-fication. The amount of combining will depend on the taskat hand, and can be done to the extent that anywhere fromeleven to two categories remain. The Mechanical Turk annotations were done over a periodof one week. For each annotator, and for each question, wecalculated the probability with which the annotator agreeswith the response chosen by the majority of the annota-tors. We identified poor annotators as those that had anagreement probability that was more than two standard de-viations away from the mean. All annotations by these an-notators were discarded. Table 2 gives a histogram of thenumber of annotations of the remaining tweets. There were1121 tweets with exactly three annotations.We determined whether a tweet is to be assigned a particularcategory based on strong majority. That is, a tweet belongsto category X if it is annotated with X more often than allother categories combined. Percentage of tweets in each ofthe 11 categories of Q1 are shown in Table 3. Observe thatthe majority category for purpose is ‘to support’—26.49% ofthe tweets were identified as having the purpose ‘to support’.Table 4 gives the distributions of the three coarse categoriesof purpose. Observe, that the political tweets express dis-agreement (58.07%) much more than support (31.76%).Table 5 gives the distributions for question 2. Observe thata large majority (95.56%) of the tweets are relevant to USpolitics and elections. This shows that the hashtags shownearlier in Table 1 are effective in identifying political tweets.
We calculated agreement on the full set of annotations, andnot just on the annotations with a strong majority as de-scribed in the previous section. One way to gauge the amountof agreement among annotators is to examine the number able 3: Percentage of tweets in each of the elevencategories of Q1. Only those tweets that were anno-tated by at least two annotators were included. Atweet belongs to category X if it is annotated withX more often than all other categories combined.There were 1072 such tweets in total.
PercentagePurpose of tweet of tweets favourto agree 0.47to praise, admire, or appreciate 15.02to support opposeto point out hypocrisy or inconsistency 7.00to point out mistake or blunder 3.45to disagree 2.52to ridicule 15.39to criticize, but none of the above 7.09to vent 8.21otherto provide information without anyemotional content 13.34none of the above 1.03all 100.0
Table 4: Percentage of tweets in each of the threecoarse categories of Q1. Only those tweets that wereannotated by at least two annotators were included.A tweet belongs to category X if it is annotated withX more often than all other categories combined.There were 1672 such tweets in total. Agreementon the 3 categories is higher than on 11 categories.Category Percentage of tweets oppose favour 31.76other 10.17all 100.0of times all three annotators agree (majority class size =3), the number of times two out of three annotators agree(majority class size = 2), and the number of times all threeannotators choose different options (majority class size = 1).Table 6 gives the distributions of the majority classes. Highernumbers for the larger class sizes indicate higher agreement.For example, for 22.4% of the tweets all three annotatorsgave the same answer for question 1 (Q1). The agreementis much higher if one only considers the coarse categoriesof ‘oppose’, ‘favour’, and ‘other’—these numbers are shownin the row marked Q1’. The agreement for question 2 wassubstantially high. This was expected as it is a relativelystraightforward question. The numbers in the table are cal-culated from tweets with exactly three annotations.Table 7 shows inter-annotator agreement (IAA) , for the twoquestions—the average percentage of times two annotatorsagree with each other. IAA gives us an understanding of thedegree of agreement through a single number. Observe thatthe agreement is only moderate for the eleven fine categoriesof purpose (43.58%), but much higher when considering thecoarser categories (83.81%).
Table 5: Percentage of tweets in each of the twocategories of Q2.
PercentageRelevance of tweets pertaining to US politics and elections not pertaining to US politics and elections 4.44all 100.0
Table 6: Percentage of tweets having majority classsize (MCS) of 1, 2, and 3. Q is short for question.MCS-1 MCS-2 MCS-3
Q1 29.5 48.1 22.4Q1’ 2.2 31.7 66.1Q2 0.0 5.7 94.3Another way to gauge agreement is by calculating the aver-age probability with which an annotator picks the majorityclass. Consider the example below: Each tweet is annotatedby 3 different annotators. X annotates 10 tweets. Six of thetimes, X’s answer for Q1 is the answer that has a majority(in case of 3 annotators, this means that at least one otherannotator also gave the same answer as X for 6 of the 10tweets). Thus the probability with which X picks the ma-jority class is 6/10. The last column in Table 7 shows the average probability of picking the majority class (APMS) bythe annotators (higher numbers indicate higher agreement).Overall, we observe that there is strong agreement betweenannotators at identifying whether the purpose of a tweet isto oppose, to favour, or something else.
Human annotators may disagree with each other becausetwo or more options may seem appropriate for a given tweet.There also exist tweets where the purpose is unclear. Table8 shows the confusion matrix for question 1. The rows andcolumns of the matrix correspond to the eleven options. Thevalue in a particular cell, say for row x and column y, isthe number of annotations that were assigned label y eventhough the majority votes for each of those tweets were for x.The highest number in each row is shown in bold. The cellsin the diagonal correspond to the number of instances forwhich the annotations matched the majority vote. For highagreement, one would want higher numbers in the diagonal,which is what we observe in Table 8.We can identify options that tend to be confused for eachother by noting non-diagonal cells with high values. Forexample, consider cell r7–c8. The relatively large numberindicates that ‘to ridicule’ is often confused with ‘to crit-icize, but none of the above’. Similarly, we find that ‘topoint out hypocrisy or inconsistency’ and ‘to point out mis-take or blunder’ are also often confused with ‘to criticize,but none of the above’ (r4–c8 and r5–c8). This suggeststhat the purpose of ‘to criticize, but none of the above’ isrelatively harder to identify. Note however, that the labelsare not confused as strongly in the other direction. That is, able 8: Confusion Matrix: Question 1 (fine-grained). The value in a particular cell, say for row x andcolumn y, is the number of annotations that were assigned label y even though the majority votes for eachof those tweets were for x. The highest number in each row is shown in bold. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 favourto agree: r1
61 1 1 5 1 5 4 3 0to support: r3 1 43
15 26 10 64 11 5 0to point out mistake or blunder: r5 0 6 16 6
29 15 46 1 3 0to disagree: r6 0 0 5 10 2
10 5 5 1 0to ridicule: r7 3 11 28 9 16 37
60 15 4 0to criticize, but none of the above: r8 1 0 22 8 5 49 30 Table 7: Agreement statistics: inter-annotatoragreement (IAA) and average probability of choos-ing the majority class (APMS).IAA APMS
Q1 43.58 0.520Q1’ 83.81 0.855Q2 96.76 0.974tweets that have a purpose of ‘to criticize’ are not confusedas much with ‘to point out hypocrisy or inconsistency’ (r8–c4), ‘to point out mistake or blunder’ (r8–c5), or ‘to ridicule’(r8–c7). Thus there is some clear signal even in ‘to criticize,but none of the above’ that human annotators are able toexploit. Among the categories of favor, ‘to praise, admire,or appreciate’ is confused with ‘to support’. This suggeststhat the category ‘to criticize, but none of the above’ servesas a hold-back for other finer-grained categories of ‘oppose’and, therefore, is often chosen by annotators for less clearmessages. A similar situation occurs in the ‘favour’ group,where the confusion occurs mostly between a more generalcategory ‘to support’ and more specific categories ‘to agree’and ‘to praise, admire, or appreciate’. Note that in a partic-ular application, one may choose only a subset of the elevencategories that are most relevant. For example, one maycombine ‘to point out hypocrisy or inconsistency’, ‘to pointout mistake or blunder’, and ‘to criticize, but none of theabove’ into a single category, and distinguish it from otheroppose categories such as ‘to disagree’ and ‘to ridicule’. Ta-ble 9 shows the confusion matrix within the coarse categoriesof question 1. The confusion between the coarse categoriesis relatively lower than among the finer categories, but yetthere exist instances when ‘favour’ is confused with ‘oppose’,and vice versa. Table 10 shows the confusion matrix forquestion 2. Only a very small number of instances are con-fused with the wrong option for this question.
Table 9: Confusion Matrix: Question 1’ (coarsegrained). c1 c2 c3 favour: r1
136 37oppose: r2 75
Table 10: Confusion Matrix: Question 2. c1 c2 not pertaining to US politics and elections: r1
The task of detecting purpose is related to sentiment andemotion classification. Intuitively, the three broad categoriesof purpose, ‘oppose’, ‘favour’, and ‘other’, roughly corre-spond to negative, positive, and objective sentiment. Also,some fine-grained categories seem to partially correlate withemotions. For example, when angry, a person vents. Whenovercome with admiration, a person praises the object ofadmiration. In our experiments, we showed that resourcescreated for emotion detection helped identify purpose.To further investigate the relation between purpose and emo-tion, we annotated a portion of the tweets by crowdsourcingwith one of 19 emotions: acceptance, admiration, amaze-ment, anger, anticipation, calmness, disappointment, dis-gust, dislike, fear, hate, indifference, joy, like, sadness, sur-prise, trust, uncertainty, and vigilance. Similar to the anno-tation of purpose, each tweet was annotated by at least twojudges, and tweets with no strong majority were discarded.Table 11 shows the percentage of tweets pertaining to dif-ferent emotions. Only high-frequency categories of purposeand emotion are shown. As expected, the tweets with the able 11: Percentage of different purpose tweets pertaining to different emotions. Low-frequency categoriesof purpose and emotion are omitted. The highest number for each category of purpose is shown in bold. admiration anticipation joy dislike disappointment disgust anger favourto praise, admire, or appreciate
21 21 4 2 7opposeto point out hypocrisy or inconsistency
17 11to point out mistake or blunder
15 8to disagree 14
14 29to ridicule 7
11 16 16to vent 4 24 12 8 purpose ‘favour’ mainly convey the emotions of admiration,anticipation, and joy. On the other hand, the tweets withthe purpose ‘oppose’ are mostly associated with negativeemotions such as dislike, anger, and disgust. The purpose‘to praise, admire, or appreciate’ is highly correlated withthe emotion admiration.Note that most of the tweets with the purpose ‘to point outhypocrisy’, ‘to point out mistake’, ‘to disagree’, ‘to ridicule’,‘to criticize’, and even many instances of ‘to vent’ are asso-ciated with the emotion dislike. Thus, a system that onlydetermines emotion and not purpose will fail to distinguishbetween these different categories of purpose. It is possiblefor people to have the same emotion of dislike and react dif-ferently: either by just disagreeing, pointing out the mistake,criticizing, or resorting to ridicule.
4. AUTOMATICALLY IDENTIFYING PUR-POSE
To automatically classify tweets into eleven categories ofpurpose, we trained a Support Vector Machine (SVM) clas-sifier. SVM is a state-of-the-art learning algorithm proved tobe effective on text categorization tasks and robust on largefeature spaces. The eleven categories were assumed to bemutually exclusive, i.e., each tweet was classified into exactlyone category. In the second set of experiments, the elevenfine-grained categories were combined into 3 coarse-grained- ‘oppose’, ‘favour’, and ‘other’ - as was described earlier.In each experiment, ten-fold stratified cross-validation wasrepeated ten times, and the results were averaged. Pairedt-test was used to confirm the significance of the results.We used the LibSVM package [6] with linear kernel and de-fault parameter settings. Parameter C was chosen by cross-validation on the training portion of the data (i.e., the ninetraining folds).The gold labels were determined by strong majority voting.Tweets with less than 2 annotations or with no majoritylabels were discarded. Thus, the dataset consisted of 1072tweets for the 11-category task, and 1672 tweets for the 3-category task. The tweets were normalized by replacing allURLs with http://someurl and all userids with @someuser.The tweets were tokenized and tagged with parts of speechusing the Carnegie Mellon University Twitter NLP tool [13].
Each tweet was represented as a feature vector with thefollowing groups of features. We drew these features fromprior work on social media and sentiment analysis [27, 3,32]. We employed commonly used text classification featuressuch as ngrams, part-of-speech, and punctuations, as wellas common Twitter-specific features such as emoticons andhashtags. Additionally, we hypothesized that the purposeof tweets is guided by the emotions of the tweeter. Thus weexplored certain emotion features as well. • n-grams: presence of n-grams (contiguous sequences of1, 2, 3, and 4 tokens), skipped n-grams (n-grams withone token replaced by *), character n-grams (contigu-ous sequences of 3, 4, and 5 characters); • POS: number of occurrences of each part-of-speech; • word clusters: presence of words from each of the 1000word clusters provided by the Twitter NLP tool [13].These clusters were produced with the Brown cluster-ing algorithm on 56 million English-language tweets.They serve as alternative representation of tweet con-tent, reducing the sparcity of the token space; • all-caps: the number of words with all characters inupper case; • NRC Emotion Lexicon: We used the NRC EmotionLexicon [22] to incorporate affect features. The lexi-con consists of 14,182 words manually annotated with8 basic emotions (anger, anticipation, disgust, fear, joy,sadness, surprise, trust) and 2 polarities (positive, neg-ative). Each word can have zero, one, or more associ-ated emotions and zero or one polarity. – number of words associated with each emotion – number of nouns, verbs, etc., associated with eachemotion – number of all-caps words associated with eachemotion – number of hashtags associated with each emotion • negation: the number of negated contexts. Follow-ing [28], we defined a negated context as a segmentof a tweet that starts with a negation word (e.g., ‘no’,‘shouldn’t’) and ends with one of the punctuation marks:‘,’, ‘.’, ‘:’, ‘;’, ‘!’, ‘?’. A negated context affects then-gram and Emotion Lexicon features: each word andassociated with it emotion in a negated context become able 12: Accuracy of the automatic classificationon 11-category and 3-category problems. The lowerbound is the percentage of the majority class.11-class 3-class majority class 26.49 58.07SVM 43.56 73.91 Table 13: Per category precision (P), recall (R),and F1 score of the classification on the 11-categoryproblem. Micro-averaged P, R, and F1 are equal toaccuracy since the categories are mutually exclusive. category favourto agree 5 0 0 0to praise 161 57.59 50.43 53.77to support 284 49.35 69.47 57.71opposeto point out hypocrisy 75 30.81 21.2 25.12to point out mistake 37 0 0 0to disagree 27 0 0 0to ridicule 165 31.56 43.76 36.67to criticize 76 22.87 9.87 13.79to vent 88 36.06 23.07 28.14otherto provide information 143 45.14 50.63 47.73none of the above 11 0 0 0micro-ave 43.56 43.56 43.56 negated (e.g., ‘not perfect’ becomes ‘not perfect NEG’,‘EMOTION trust’ becomes ‘EMOTION trust NEG’).The list of negation words was adopted from Christo-pher Potts’ sentiment tutorial. • punctuation: the number of contiguous sequences ofexclamation marks, question marks, and both excla-mation and question marks; • emoticons: presence/absence of positive and negativeemoticons. The polarity of an emoticon was deter-mined with a simple regular expression adopted fromChristopher Potts’ tokenizing script. • hashtags: the number of hashtags; • elongated words: the number of words with one char-acter repeated more than 2 times, e.g. ‘soooo’.Table 12 presents the results of the automatic classificationfor the 11-category and 3-category problems. For compari-son, we also provide the accuracy of a simple baseline clas-sifier that always predicts the majority class.Table 13 shows the classification results broken-down by cat-egory. As expected, the categories with larger amounts oflabeled examples (‘to praise’, ‘to support’, ‘to provide in-formation’) have higher results. However, for one of thehigher frequency categories, ‘to ridiculeˆa ˘A´Z, the F1-score isrelatively low. This category incorporates irony, sarcasm, http://sentiment.christopherpotts.net/lingstruc.html http://sentiment.christopherpotts.net/tokenizing.html Table 14: Accuracy of classification with one of thefeature groups removed. Numbers in bold representstatistically significant difference with the accuracyof the ‘all features’ classifier (first line) with 95%confidence.
Experiment 11-class 3-class all features 43.56 73.91all - n-grams all - NRC emotion lexicon all - parts of speech all - word clusters all - negation all - (all-caps, punctuation,emoticons, hashtags) 43.38 73.87
Table 15: Accuracy of classification using differentlexicons on the 11-class problem. Numbers in boldrepresent statistically significant difference with theaccuracy of the classifier using the NRC EmotionLexicon (first line) with 95% confidence.Lexicon Accuracy
NRC Emotion Lexicon 43.56Hashtag Lexicon 44.35both lexicons and humour, the concepts that are hard to recognize, es-pecially in a very restricted context of 140 characters. Thefour low-frequency categories (‘to agree’, ‘to point out mis-take or blunder’, ‘to disagree’, ‘none of the above’) did nothave enough training data for the classifier to build adequatemodels. The categories within ‘oppose’ are more difficult todistinguish among than the categories within ‘favour’. How-ever, for the most part this can be explained by the largernumber of categories (6 in ‘oppose’ vs. 3 in ‘favour’) and,consequently, smaller sizes of the individual categories.In the next set of experiments, we investigated the usefulnessof each feature group for the task. We repeated the aboveclassification process, each time removing one of the featuregroups from the tweet representation. Table 14 shows theresults of these ablation experiments for the 11-category and3-category problems. In both cases, the most influentialfeatures were found to be n-grams, emotion lexicon features,part-of-speech tags, and word clusters.
Since the emotion lexicon had a significant impact on theresults, we further created a wide-coverage twitter-specificlexical resource following on work by Mohammad [21]. [21]showed that emotion-word hashtagged tweets are a goodsource of labeled data for automatic emotion processing.Those experiments were conducted using tweets pertainingto the six Ekman emotions because labeled evaluation dataexists for only those emotions. However, a significant ad-vantage of using hashtagged tweets is that we can collectlarge amounts of labeled data for any emotion that is useds a hashtag by tweeters. Thus we polled the Twitter APIand collected a large corpus of tweets pertaining to a fewhundred emotions.We used a list of 585 emotion words compiled by Zeno G.Swijtink as the hashtagged query words. Note that wechose not to dwell on the question of whether each of thewords in this set is truly an emotion or not. Our goal was tocreate and distribute a large set of affect-labeled data, andusers are free to choose a subset of the data that is relevantto their application. We calculated the pointwise mutual in-formation (PMI) between an emotional hashtag and a wordappearing in tweets. The PMI represents a degree of correla-tion between the word and emotion, with larger scores repre-senting stronger correlations. Consequently, the pairs (word,hashtag) that had positive PMI were pulled together into anew word–emotion association resource, that we call
Hash-tag Emotion Lexicon . The lexicon contains around 10,000words with associations to 585 emotion-word hashtags.We used the Hashtag Lexicon for classification by creating aseparate feature for each emotion-related hashtag, resultingin 585 emotion features. The values of these features werecalculated as the sum of the PMI scores between the wordsin a tweet and the corresponding emotion-related hashtag.Table 15 shows the results of the automatic classificationusing the new lexical resource. The Hashtag Lexicon sig-nificantly improved the performance of the classifier on the11-category task. Even better results were obtained whenboth lexicons were employed .
5. CONCLUSIONS
Tweets are playing a growing role in the public discourseon politics. In this paper, we explored the purpose behindsuch tweets. Detecting purpose has a number of applica-tions including detecting the mood of the electorate, esti-mating the popularity of policies, identifying key issues ofcontention, and predicting the course of events. We com-piled a dataset of 1 million tweets pertaining to the 2012 USpresidential elections using relevant hashtags. We designedan online questionnaire and annotated a few thousand tweetsfor purpose via crowdsourcing. We analyzed these tweetsand showed that a large majority convey emotional attitudetowards someone or something. Further, the number of mes-sages posted to oppose someone or something were almosttwice the number of messages posted to offer support.We developed a classifier to automatically classify electoraltweets as per their purpose. It obtained an accuracy of43.56% on a 11-class task and an accuracy of 73.91% on a3-class task (both accuracies well above the most-frequent-class baseline). We found that resources developed for emo-tion detection, such as the NRC word–emotion associationlexicon, are also helpful for detecting purpose. However, wealso showed that emotion detection alone can fail to distin-guish between several kinds of purpose. We make all thedata created as part of this research freely available. Using the Hashtag Lexicon pertaining to hundreds of emo-tions on the 3-category task did not show any improvement.This is probably because there the information about posi-tive and negative sentiment provides the most gain. In this paper, we relied only on the target tweet as context.However, it might be possible to obtain even better resultsby modeling user behaviour based on multiple past tweets.We are also interested in using purpose-annotated tweetsas input in a system that automatically summarizes polit-ical tweets. Finally, we hope that a better understandingof purpose of tweets will help drive the political discoursetowards issues and concerns most relevant to the people.
6. REFERENCES [1] A. C. Alhadi, S. Staab, and T. Gottron. ExploringUser Purpose Writing Single Tweets. In
WebSci’11:Proceedings of the 3rd International Conference onWeb Science , 2011.[2] D. G. Avello. ”I Wanted to Predict Elections withTwitter and all I got was this Lousy Paper” – ABalanced Survey on Election Prediction using TwitterData. arXiv , 1204.6441, 2012.[3] L. Barbosa and J. Feng. Robust sentiment detectionon twitter from biased and noisy data. In
Proceedingsof Coling: Poster Volume , pages 36–44, Beijing,China, August 2010.[4] A. Bermingham and A. Smeaton. On Using Twitter toMonitor Political Sentiment and Predict ElectionResults. In
Proceedings of the Workshop on SentimentAnalysis where AI meets Psychology (SAAIP 2011) ,pages 2–10, Chiang Mai, Thailand, 2011. AsianFederation of Natural Language Processing.[5] C. Caragea, M. McNeese, A. Jaiswal, G. Traylor,H. Kim, P. Mitra, D. Wu, A. Tapia, C. Giles,J. Jansen, and J. Yen. Classifying Text Messages forthe Haiti Earthquake. In
Proceedings of the 8thInternational Conference on Information Systems forCrisis Response and Management (ISCRAM) , Lisbon,Portugal, 2011.[6] C.-C. Chang and C.-J. Lin. LIBSVM: A Library forSupport Vector Machines.
ACM Transactions onIntelligent Systems and Technology , 2:27:1–27:27,2011. Software available at .[7] C. M. Cheung and M. K. Lee. Understanding theSustainability of a Virtual Community: ModelDevelopment and Empirical Test.
Journal ofInformation Science , 35(3):279–298, 2009.[8] J. E. Chung and E. Mustafaraj. Can CollectiveSentiment Expressed on Twitter Predict PoliticalElections? In W. Burgard and D. Roth, editors,
Proceedings of the 25th AAAI Conference on ArtificialIntelligence , California, USA, 2011. AAAI Press.[9] N. Collier, N. Son, and N. Nguyen. OMG U got flu?Analysis of Shared Health Messages forBio-surveillance.
Journal of Biomedical Semantics ,2(Suppl 5):S9, 2011.[10] M. D. Conover, B. Goncalves, J. Ratkiewicz,A. Flammini, and F. Menczer. Predicting the PoliticalAlignment of Twitter Users. In
IEEE ThirdInternational Conference on Privacy Security Risk andTrust and IEEE Third International Conference onSocial Computing , pages 192–199. IEEE, 2011.[11] M. D. Conover, J. Ratkiewicz, M. Francisco, B. Gonc,A. Flammini, and F. Menczer. Political Polarizationon Twitter.
Networks , 133(26):89–96, 2011.12] P. Ekman. An Argument for Basic Emotions.
Cognition and Emotion , 6(3):169–200, 1992.[13] K. Gimpel, N. Schneider, B. O’Connor, D. Das,D. Mills, J. Eisenstein, M. Heilman, D. Yogatama,J. Flanigan, and N. A. Smith. Part-of-Speech Taggingfor Twitter: Annotation, Features, and Experiments.In
Proceedings of the Annual Meeting of theAssociation for Computational Linguistics , 2011.[14] J. Golbeck and D. Hansen. Computing PoliticalPreference Among Twitter Followers. In
Proceedingsof the SIGCHI Conference on Human Factors inComputing Systems , CHI ’11, pages 1105–1108, NewYork, NY, 2011. ACM.[15] L. Jiang, M. Yu, M. Zhou, X. Liu, and T. Zhao.Target-Dependent Twitter Sentiment Classification. In
Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics(ACL-2011) , pages 151–160, 2011.[16] S. Kim, J. Bak, and A. H. Oh. Do You Feel What IFeel? Social Aspects of Emotions in TwitterConversations. In
Proceedings of the InternationalAAAI Conference on Weblogs and Social Media , 2012.[17] E. Kouloumpis, T. Wilson, and J. Moore. TwitterSentiment Analysis: The Good the Bad and theOMG! In
Proceedings of the Fifth International AAAIConference on Weblogs and Social Media , 2011.[18] K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary,A. Agrawal, and A. Choudhary. Twitter TrendingTopic Classification. In
Proceedings of the IEEE 11thInternational Conference on Data Mining Workshops(ICDMW) , pages 251–258. IEEE, 2011.[19] I. L. B. Liu, C. M. K. Cheung, and M. K. O. Lee.
Understanding Twitter Usage: What Drive PeopleContinue to Tweet , pages 928–939. 2010.[20] D. Maynard and A. Funk. Automatic Detection ofPolitical Opinions in Tweets. In
The Semantic Web:ESWC 2011 Workshops , pages 88–99. Springer, 2011.[21] S. Mohammad.
Proceedings ofthe First Joint Conference on Lexical andComputational Semantics (*SEM) , pages 246–255,Montr´eal, Canada, 2012. Association forComputational Linguistics.[22] S. M. Mohammad and P. D. Turney. Emotions Evokedby Common Words and Phrases: Using MechanicalTurk to Create an Emotion Lexicon. In
Proceedings ofthe NAACL-HLT 2010 Workshop on ComputationalApproaches to Analysis and Generation of Emotion inText , LA, California, 2010.[23] M. Naaman, J. Boase, and C.-H. Lai. Is It ReallyAbout Me?: Message Content in Social AwarenessStreams. In
Proceedings of the 2010 ACM Conferenceon Computer Supported Cooperative Work , CSCW ’10,pages 189–192, New York, NY, 2010. ACM.[24] K. Nishida, R. Banno, K. Fujimura, and T. Hoshide.Tweet Classification by Data Compression. In
Proceedings of the International Workshop onDetecting and Exploiting Cultural Diversity on theSocial Web , pages 29–34. ACM, 2011.[25] B. O’Connor, R. Balasubramanyan, B. R. Routledge,and N. A. Smith. From Tweets to Polls: Linking TextSentiment to Public Opinion Time Series. In
Proceedings of the International AAAI Conference on Weblogs and Social Media , 2010.[26] A. Pak and P. Paroubek. Twitter as a Corpus forSentiment Analysis and Opinion Mining. In
Proceedings of LREC , 2010.[27] B. Pang and L. Lee. Opinion mining and sentimentanalysis.
Foundations and Trends in InformationRetrieval , 2(1–2):1–135, 2008.[28] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?:Sentiment Classification Using Machine LearningTechniques. In
Proceedings of the Conference onEmpirical Methods in Natural Language Processing ,pages 79–86, Philadelphia, PA, 2002.[29] L. Pearl and M. Steyvers. Identifying Emotions,Intentions, and Attitudes in Text Using a Game witha Purpose. In
Proceedings of the NAACL-HLT 2010Workshop on Computational Approaches to Analysisand Generation of Emotion in Text , Los Angeles,California, 2010.[30] R. Plutchik. A General Psychoevolutionary Theory ofEmotion.
Emotion: Theory, research, and experience ,1(3):3–33, 1980.[31] D. Quercia, J. Ellis, L. Capra, and J. Crowcroft.Tracking ”Gross Community Happiness” from Tweets.In
Proceedings of the ACM Conference on ComputerSupported Cooperative Work , CSCW ’12, pages965–968, New York, NY, 2012. ACM.[32] D. Rao and D. Yarowsky. Detecting latent userproperties in social media. In
NIPS workshop onMachine Learning for Social Networks (MLSC) , 2010.[33] T. Sakaki, M. Okazaki, and Y. Matsuo. EarthquakeShakes Twitter Users: Real-time Event Detection bySocial Sensors. In
Proceedings of the 19thInternational Conference on World Wide Web , pages851–860. ACM, 2010.[34] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D.Lieberman, and J. Sperling. TwitterStand: News inTweets. In
Proceedings of the 17th ACM SIGSPATIALInternational Conference on Advances in GeographicInformation Systems , GIS ’09, pages 42–51, New York,NY, 2009. ACM.[35] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu,and M. Demirbas. Short Text Classification in Twitterto Improve Information Filtering. In
Proceedings of the33rd International ACM SIGIR Conference onResearch and Development in Information Retrieval ,SIGIR ’10, pages 841–842, New York, NY, 2010. ACM.[36] C. Strapparava and A. Valitutti. WordNet-Affect: AnAffective Extension of WordNet. In
Proceedings of the4th International Conference on Language Resourcesand Evaluation (LREC-2004) , pages 1083–1086,Lisbon, Portugal, 2004.[37] K. Tsagkalidou, V. Koutsonikola, A. Vakali, andK. Kafetsios. Emotional Aware Clustering onMicro-blogging Sources. In
Proceedings of the 4thInternational Conference on Affective Computing andIntelligent Interaction , pages 387–396, Memphis, TN,2011.[38] A. Tumasjan, T. O. Sprenger, P. G. Sandner, andI. M. Welpe. Election Forecasts With Twitter: How140 Characters Reflect the Political Landscape.