[PDF] How to evaluate sentiment classifiers for Twitter time-ordered data?

Abstract

Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so. Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. The problem we address concerns the procedures used to obtain reliable estimates of performance measures, and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5 million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets, which are used as a gold standard for evaluations. The corresponding 138 in-sample datasets are used to empirically compare six different estimation procedures: three variants of cross-validation, and three variants of sequential validation (where test set always follows the training set). We find no significant difference between the best cross-validation and sequential validation. However, we observe that all cross-validation variants tend to overestimate the performance, while the sequential methods tend to underestimate it. Standard cross-validation with random selection of examples is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.

Full PDF

HHow to evaluate sentiment classifiers for Twitter time-ordered data?

Igor Mozetiˇc , Luis Torgo , Vitor Cerqueira , Jasmina Smailovi´c Department of Knowledge Technologies, Joˇzef Stefan Institute, Ljubljana, Slovenia INESC TEC, Porto, Portugal Faculty of Sciences, University of Porto, Porto, Portugal* [email protected]

Abstract

Social media are becoming an increasingly important source of information about the public moodregarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentimentclassification of Twitter data. Construction of sentiment classifiers is a standard text mining task, buthere we address the question of how to properly evaluate them as there is no settled way to do so.Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. Theproblem we address concerns the procedures used to obtain reliable estimates of performance measures,and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets,which are used as a gold standard for evaluations. The corresponding 138 in-sample datasets are used toempirically compare six different estimation procedures: three variants of cross-validation, and threevariants of sequential validation (where test set always follows the training set). We find no significantdifference between the best cross-validation and sequential validation. However, we observe that allcross-validation variants tend to overestimate the performance, while the sequential methods tend tounderestimate it. Standard cross-validation with random selection of examples is significantly worsethan the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered datascenarios.

Online social media are becoming increasingly important in our society. Platforms such as Twitter andFacebook influence the daily lives of people around the world. Their users create and exchange a widevariety of contents on social media, which presents a valuable source of information about publicsentiment regarding social, economic or political issues. In this context, it is important to developautomatic methods to retrieve and analyze information from social media.In the paper we address the task of sentiment analysis of Twitter data. The task encompassesidentification and categorization of opinions (e.g., negative, neutral, or positive) written in quasi-naturallanguage used in Twitter posts. We focus on estimation procedures of the predictive performance ofmachine learning models used to address this task. Performance estimation procedures are key tounderstand the generalization ability of the models since they present approximations of how thesemodels will behave on unseen data. In the particular case of sentiment analysis of Twitter data, highvolumes of content are continuously being generated and there is no immediate feedback about the trueclass of instances. In this context, it is fundamental to adopt appropriate estimation procedures in orderto get reliable estimates about the performance of the models.The complexity of Twitter data raises some challenges on how to perform such estimations, as, to thebest of our knowledge, there is currently no settled approach to this. Sentiment classes are typicallyordered and unbalanced, and the data itself is time-ordered. Taking these properties into account isimportant for the selection of appropriate estimation procedures.The Twitter data shares some characteristics of time series and some of static data. A time series isan array of observations at regular or equidistant time points, and the observations are in general1 a r X i v : . [ c s . C L ] M a r ependent on previous observations [1]. On the other hand, Twitter data is time-ordered, but theobservations are short texts posted by Twitter users at any time and frequency. It can be assumed thatoriginal Twitter posts are not directly dependent on previous posts. However, there is a potentialindirect dependence, demonstrated in important trends and events, through influential users andcommunities, or individual user’s habits. These long-term topic drifts are typically not taken intoaccount by the sentiment analysis models.We study different performance estimation procedures for sentiment analysis in Twitter data. Theseestimation procedures are based on ( i ) cross-validation and ( ii ) sequential approaches typically adoptedfor time series data. On one hand, cross-validations explore all the available data, which is important forthe robustness of estimates. On the other hand, sequential approaches are more realistic in the sensethat estimates are computed on a subset of data always subsequent to the data used for training, whichmeans that they take time-order into account.Our experimental study is performed on a large collection of nearly 1.5 million Twitter posts, whichare domain-free and in 13 different languages. A realistic scenario is emulated by partitioning the datainto 138 datasets by language and time window. Each dataset is split into an in-sample (a training plustest set), where estimation procedures are applied to approximate the performance of a model, and anout-of-sample used to compute the gold standard. Our goal is to understand the ability of eachestimation procedure to approximate the true error incurred by a given model on the out-of-sample data.The paper is structured as follows. Related work provides an overview of the state-of-the-art inestimation methods. In section Methods and experiments we describe the experimental setting for anempirical comparison of estimation procedures for sentiment classification of time-ordered Twitter data.We describe the Twitter sentiment datasets, a machine learning algorithm we employ, performancemeasures, and how the gold standard and estimation results are produced. In section Results anddiscussion we present and discuss the results of comparisons of the estimation procedures along severaldimensions. Conclusions provide the limitations of our work and give directions for the future. In this section we briefly review typical estimation methods used in sentiment classification of Twitterdata. In general, for time-ordered data, the estimation methods used are variants of cross-validation, orare derived from the methods used to analyze time series data. We examine the state-of-the-art of theseestimation methods, pointing out their advantages and drawbacks.Several works in the literature on sentiment classification of Twitter data employ standardcross-validation procedures to estimate the performance of sentiment classifiers. For example, Agarwalet al. [2] and Mohammad et al. [3] propose different methods for sentiment analysis of Twitter data andestimate their performance using 5-fold and 10-fold cross-validation, respectively. Bermingham andSmeaton [4] produce a comparative study of sentiment analysis between blogs and Twitter posts, wheremodels are compared using 10-fold cross-validation. Saif et al. [5] asses binary classification performanceof nine Twitter sentiment datasets by 10-fold cross validation. Other, similar applications ofcross-validation are given in [6, 7].On the other hand, there are also approaches that use methods typical for time series data. Forexample, Bifet and Frank [8] use the prequential (predictive sequential) method to evaluate a sentimentclassifier on a stream of Twitter posts. Moniz et al. [9] present a method for predicting the popularity ofnews from Twitter data and sentiment scores, and estimate its performance using a sequential approachin multiple testing periods.The idea behind the K -fold cross-validation is to randomly shuffle the data and split it in K equally-sized folds. Each fold is a subset of the data randomly picked for testing. Models are trained onthe K − K -fold cross-validation hasseveral practical advantages, such as an efficient use of all the data. However, it is also based on anassumption that the data is independent and identically distributed [10] which is often not true. Forexample, in time-ordered data, such as Twitter posts, the data are to some extent dependent due to theunderlying temporal order of tweets. Therefore, using K -fold cross-validation means that one uses future2nformation to predict past events, which might hinder the generalization ability of models.There are several methods in the literature designed to cope with dependence between observations.The most common are sequential approaches typically used in time series forecasting tasks. Somevariants of K -fold cross-validation which relax the independence assumption were also proposed. Fortime-ordered data, an estimation procedure is sequential when testing is always performed on the datasubsequent to the training set. Typically, the data is split into two parts, where the first is used to trainthe model and the second is held out for testing. These approaches are also known in the literature asthe out-of-sample methods [11, 12].Within sequential estimation methods one can adopt different strategies regarding train/testsplitting, growing or sliding window setting, and eventual update of the models. In order to producereliable estimates and test for robustness, Tashman [11] recommends employing these strategies inmultiple testing periods. One should either create groups of data series according to, for example,different business cycles [13], or adopt a randomized approach, such as in [14]. A more completeoverview of these approaches is given by Tashman [11].In stream mining, where a model is continuously updated, the most commonly used estimationmethods are holdout and prequential [15, 16]. The prequential strategy uses an incoming observation tofirst test the model and then to train it.Besides sequential estimation methods, some variants of K -fold cross-validation were proposed in theliterature that are specially designed to cope with dependency in the data and enable the application ofcross-validation to time-ordered data. For example, blocked cross-validation (the name is adopted fromBergmeir [12]) was proposed by Snijders [17]. The method derives from a standard K -foldcross-validation, but there is no initial random shuffling of observations. This renders K blocks ofcontiguous observations.The problem of data dependency for cross-validation is addressed by McQuarrie and Tsai [18]. Themodified cross-validation removes observations from the training set that are dependent with the testobservations. The main limitation of this method is its inefficient use of the available data since manyobservations are removed, as pointed out in [19]. The method is also known as non-dependentcross-validation [12].The applicability of variants of cross-validation methods in time series data, and their advantagesover traditional sequential validations are corroborated by Bergmeir et al. [12, 20, 21]. The authorsconclude that in time series forecasting tasks, the blocked cross-validations yield better error estimatesbecause of their more efficient use of the available data. Cerqueira et al. [22] compare performanceestimation of various cross-validation and out-of-sample approaches on real-world and synthetic timeseries data. The results indicate that cross-validation is appropriate for the stationary synthetic timeseries data, while the out-of-sample approaches yield better estimates for real-world data.Our contribution to the state-of-the-art is a large scale empirical comparison of several estimationprocedures on Twitter sentiment data. We focus on the differences between the cross-validation andsequential validation methods, to see how important is the violation of data independence in the case ofTwitter posts. We consider longer-term time-dependence between the training and test sets, andcompletely ignore finer-scale dependence at the level of individual tweets (e.g., retweets and replies). Tothe best of our knowledge, there is no settled approach yet regarding proper validation of models forTwitter time-ordered data. This work provides some results which contribute to bridging that gap. The goal of this study is to recommend appropriate estimation procedures for sentiment classification ofTwitter time-ordered data. We assume a static sentiment classification model applied to a stream ofTwitter posts. In a real-case scenario, the model is trained on historical, labeled tweets, and applied tothe current, incoming tweets. We emulate this scenario by exploring a large collection of nearly 1.5million manually labeled tweets in 13 European languages (see subsection Data and models). Eachlanguage dataset is split into pairs of the in-sample data, on which a model is trained, and theout-of-sample data, on which the model is validated. The performance of the model on the3ut-of-sample data gives an estimate of its performance on the future, unseen data. Therefore, we firstcompute a set of 138 out-of-sample performance results, to be used as a gold standard (subsection Goldstandard). In effect, our goal is to find the estimation procedure that best approximates thisout-of-sample performance.Throughout our experiments we use only one training algorithm (subsection Data and models), andtwo performance measures (subsection Performance measures). During training, the performance of thetrained model can be estimated only on the in-sample data. However, there are different estimationprocedures which yield these approximations. In machine learning, a standard procedure iscross-validation, while for time-ordered data, sequential validation is typically used. In this study, wecompare three variants of cross-validation and three variants of sequential validation (subsectionEstimation procedures). The goal is to find the in-sample estimation procedure that best approximatesthe out-of-sample gold standard. The error an estimation procedure makes is defined as the difference tothe gold standard.

We collected a large corpus of nearly 1.5 million Twitter posts written in 13 European languages. Thisis, to the best of our knowledge, by far the largest set of sentiment labeled tweets publicly available. Weengaged native speakers to label the tweets based on the sentiment expressed in them. The sentimentlabel has three possible values: negative, neutral or positive. It turned out that the human annotatorsperceived the values as ordered. The quality of annotations varies though, and is estimated from the self-and inter-annotator agreements. All the details about the datasets, the annotator agreements, and theordering of sentiment values are in our previous study [23]. The sentiment distribution and quality ofindividual language datasets is in Table 1. The tweets in the datasets are ordered by tweet ids, whichcorresponds to ordering by the time of posting.

Table 1.

Sentiment label distribution of Twitter datasets in 13 languages. The last column is aqualitative assessment of the annotation quality, based on the levels of the self- and inter-annotatoragreement. Language Negative Neutral Positive Total QualityAlbanian alb 7,062 15,066 23,630 45,758 poorBulgarian bul 14,374 28,961 19,932 63,267 fairEnglish eng 23,250 38,457 25,721 87,428 v.goodGerman ger 19,039 52,166 26,743 97,948 fairHungarian hun 9,062 17,833 30,410 57,305 goodPolish pol 59,027 48,658 84,245 191,930 goodPortuguese por 56,008 53,026 43,009 152,043 fairRussian rus 30,249 37,401 25,671 93,321 goodSer/Cro/Bos scb 58,796 61,265 73,766 193,827 fairSlovak slk 15,060 13,112 30,598 58,770 goodSlovenian slv 34,164 48,458 30,210 112,832 goodSpanish spa 27,675 88,481 117,048 233,204 poorSwedish swe 22,381 15,387 13,630 51,398 goodTotal 376,147 518,271 544,613 1,439,031There are many supervised machine learning algorithms suitable for training sentiment classificationmodels from labeled tweets. In this study we use a variant of Support Vector Machine (SVM) [24]. Thebasic SVM is a two-class, binary classifier. In the training phase, SVM constructs a hyperplane in ahigh-dimensional vector space that separates one class from the other. In the classification phase, theside of the hyperplane determines the class. A two-class SVM can be extended into a multi-classclassifier which takes the ordering of sentiment values into account, and implements ordinalclassification [25]. Such an extension consists of two SVM classifiers: one classifier is trained to separatethe negative examples from the neutral-or-positives; the other separates the negative-or-neutrals from4he positives. The result is a classifier with two hyperplanes, which partitions the vector space into threesubspaces: negative, neutral, and positive. During classification, the distances from both hyperplanesdetermine the predicted class. A further refinement is a

TwoPlaneSVMbin classifier. It partitions thespace around both hyperplanes into bins, and computes the distribution of the training examples inindividual bins. During classification, the distances from both hyperplanes determine the appropriatebin, but the class is determined as the majority class in the bin.The vector space is defined by the features extracted from the Twitter posts. The posts are firstpre-processed by standard text processing methods, i.e., tokenization, stemming/lemmatization (ifavailable for a specific language), unigram and bigram construction, and elimination of terms that donot appear at least 5 times in a dataset. The Twitter specific pre-processing is then applied, i.e,replacing URLs, Twitter usernames and hashtags with common tokens, adding emoticon features fordifferent types of emoticons in tweets, handling of repetitive letters, etc. The feature vectors are thenconstructed by the Delta TF-IDF weighting scheme [26].In our previous study [23] we compared five variants of the SVM classifiers and Naive Bayes on theTwitter sentiment classification task. TwoPlaneSVMbin was always between the top, but statisticallyindistinguishable, best performing classifiers. It turned out that monitoring the quality of theannotation process has much larger impact on the performance than the type of the classifier used. Inthis study we fix the classifier, and use TwoPlaneSVMbin in all the experiments.

Sentiment values are ordered, and distribution of tweets between the three sentiment classes is oftenunbalanced. In such cases, accuracy is not the most appropriate performance measure [8, 23]. In thiscontext, we evaluate performance with the following two metrics: Krippendorff’s

Alpha [27], and F [28]. Alpha was developed to measure the agreement between human annotators, but can also be used tomeasure the agreement between classification models and a gold standard. It generalizes severalspecialized agreement measures, takes ordering of classes into account, and accounts for the agreementby chance.

Alpha is defined as follows:

Alpha = 1 − D o D e (1)where D o is the observed disagreement between models, and D e is a disagreement, expected by chance.When models agree perfectly, Alpha = 1, and when the level of agreement equals the agreement bychance,

Alpha = 0. Note that

Alpha can also be negative. The two disagreement measures are definedas: D o = 1 N (cid:88) c,c (cid:48) N ( c, c (cid:48) ) · δ ( c, c (cid:48) ) , (2) D e = 1 N ( N − (cid:88) c,c (cid:48) N ( c ) · N ( c (cid:48) ) · δ ( c, c (cid:48) ) . (3)The arguments, N, N ( c, c (cid:48) ) , N ( c ), and N ( c (cid:48) ), refer to the frequencies in a coincidence matrix, definedbelow. c (and c (cid:48) ) is a discrete sentiment variable with three possible values: negative ( − neutral (0),or positive (+1). δ ( c, c (cid:48) ) is a difference function between the values of c and c (cid:48) , for ordered variablesdefined as: δ ( c, c (cid:48) ) = | c − c (cid:48) | c, c (cid:48) ∈ {− , , +1 } . (4)Note that disagreements D o and D e between the extreme classes ( negative and positive ) are four timeslarger than between the neighbouring classes.A coincidence matrix tabulates all pairable values of c from two models. In our case, we have a3-by-3 coincidence matrix, and compare a model to the gold standard. The coincidence matrix is thenthe sum of the confusion matrix and its transpose. Each labeled tweet is entered twice, once as a ( c, c (cid:48) )pair, and once as a ( c (cid:48) , c ) pair. N ( c, c (cid:48) ) is the number of tweets labeled by the values c and c (cid:48) bydifferent models, N ( c ) and N ( c (cid:48) ) are the totals for each value, and N is the grand total.5 is an instance of the F score, a well-known performance measure in information retrieval [29] andmachine learning. We use an instance specifically designed to evaluate the 3-class sentiment models [28]. F is defined as follows: F = F ( −

1) + F (+1)2 . (5) F implicitly takes into account the ordering of sentiment values, by considering only the extreme labels, negative ( −

1) and positive (+1). The middle, neutral , is taken into account only indirectly. F ( c ) is theharmonic mean of precision and recall for class c , c ∈ {− , +1 } . F = 1 implies that all negative andpositive tweets were correctly classified, and as a consequence, all neutrals as well. F = 0 indicates thatall negative and positive tweets were incorrectly classified. F does not account for correct classificationby chance. We create the gold standard results by splitting the data into the in-sample datasets (abbreviated asin-set), and out-of-sample datasets (abbreviated as out-set). The terminology of the in- and out-set isadopted from Bergmeir et al. [12]. Tweets are ordered by the time of posting. To emulate a realisticscenario, an out-set always follows the in-set. From each language dataset (Table 1) we create L in-setsof varying length in multiples of 10,000 consecutive tweets, where L = (cid:98) N/ (cid:99) . The out-set is thesubsequent 10,000 consecutive tweets, or the remainder at the end of each language dataset. This isillustrated in Figure 1. Fig 1.

Creation of the estimation and gold standard data. Each labeled language dataset (Table 1) ispartitioned into L in-sets and corresponding out-sets. The in-sets always start at the first tweet and areprogressively longer in multiples of 10,000 tweets. The corresponding out-set is the subsequent 10,000consecutive tweets, or the remainder at the end of the language dataset.The partitioning of the language datasets results in 138 in-sets and corresponding out-sets. For eachin-set, we train a TwoPlaneSVMbin sentiment classification model, and measure its performance, interms of Alpha and F , on the corresponding out-set. The results are in Tables 2 and 3. Note that theperformance measured by Alpha is considerably lower in comparison to F , since the baseline for Alpha is classification by chance.The 138 in-sets are used to train sentiment classification models and estimate their performance. Thegoal of this study is to analyze different estimation procedures in terms of how well they approximatethe out-set gold standard results shown in Tables 2 and 3.6 able 2.

Gold standard performance results as measured by

Alpha . The baseline,

Alpha = 0, indicatesclassification by chance. alb bul eng ger hun pol por rus scb slk slv spa swe0.210 0.321 0.414 0.391 0.419 0.409 0.338 0.369 0.275 0.367 0.327 0.171 0.4700.102 0.324 0.433 0.420 0.453 0.432 0.336 0.420 0.393 0.411 0.380 0.222 0.4630.084 0.339 0.449 0.423 0.482 0.479 0.360 0.441 0.408 0.425 0.414 0.255 0.4580.106 0.363 0.474 0.416 0.460 0.499 0.428 0.435 0.457 0.438 0.439 0.269 0.4730.375 0.513 0.387 0.475 0.486 0.183 0.478 0.421 0.454 0.453 0.211 0.4800.397 0.513 0.403 0.487 0.176 0.452 0.327 0.478 0.2270.541 0.406 0.483 0.224 0.492 0.293 0.455 0.2260.526 0.354 0.512 0.333 0.474 0.341 0.418 0.2270.351 0.467 0.388 0.489 0.358 0.425 0.1510.513 0.409 0.384 0.418 0.1930.491 0.425 0.382 0.320 0.1960.526 0.434 0.485 0.2200.549 0.439 0.528 0.2330.535 0.453 0.551 0.2070.541 0.472 0.512 0.2020.500 0.533 0.1790.544 0.418 0.1590.532 0.514 0.2070.528 0.479 0.2160.2510.2410.1100.142

Table 3.

Gold standard performance results as measured by F . The baseline, F = 0, indicates that allnegative and positive examples are classified incorrectly. alb bul eng ger hun pol por rus scb slk slv spa swe0.479 0.509 0.545 0.578 0.610 0.621 0.356 0.551 0.492 0.616 0.485 0.436 0.6270.396 0.501 0.567 0.595 0.624 0.632 0.358 0.560 0.569 0.657 0.533 0.452 0.6200.387 0.498 0.571 0.588 0.637 0.653 0.383 0.572 0.577 0.669 0.567 0.504 0.6290.388 0.510 0.595 0.561 0.628 0.670 0.449 0.571 0.626 0.670 0.593 0.473 0.6300.513 0.634 0.533 0.640 0.651 0.243 0.604 0.580 0.675 0.603 0.446 0.6580.535 0.640 0.537 0.663 0.252 0.588 0.485 0.624 0.4540.654 0.529 0.656 0.322 0.617 0.469 0.550 0.4400.647 0.409 0.682 0.448 0.610 0.493 0.521 0.4380.413 0.654 0.529 0.614 0.503 0.524 0.4290.672 0.556 0.526 0.507 0.4240.659 0.589 0.573 0.415 0.4120.680 0.605 0.654 0.4070.696 0.608 0.686 0.4310.679 0.624 0.696 0.3980.682 0.638 0.665 0.4030.650 0.684 0.4020.670 0.644 0.3900.663 0.661 0.4460.663 0.625 0.4790.5160.5160.4230.449 .4 Estimation procedures There are different estimation procedures, some more suitable for static data, while others are moreappropriate for time-series data. Time-ordered Twitter data shares some properties of both types ofdata. When training an SVM model, the order of tweets is irrelevant and the model does not capturethe dynamics of the data. When applying the model, however, new tweets might introduce newvocabulary and topics. As a consequence, the temporal ordering of training and test data has a potentialimpact on the performance estimates.We therefore compare two classes of estimation procedures. Cross-validation, commonly used inmachine learning for model evaluation on static data, and sequential validation, commonly used fortime-series data. There are many variants and parameters for each class of procedures. Our datasets arerelatively large and an application of each estimation procedure takes several days to complete. We haveselected three variants of each procedure to provide answers to some relevant questions.First, we apply 10-fold cross-validation where the training:test set ratio is always 9:1.Cross-validation is stratified when the fold partitioning is not completely random, but each fold hasroughly the same class distribution. We also compare standard random selection of examples to the blocked form of cross-validation [12, 17], where each fold is a block of consecutive tweets. We use thefollowing abbreviations for cross-validations: • xval(9:1, strat, block) - 10-fold, stratified, blocked; • xval(9:1, no-strat, block) - 10-fold, not stratified, blocked; • xval(9:1, strat, rand) - 10-fold, stratified, random selection of examples.In sequential validation, a sample consists of the training set immediately followed by the test set.We vary the ratio of the training and test set sizes, and the number and distribution of samples takenfrom the in-set. The number of samples is 10 or 20, and they are distributed equidistantly orsemi-equidistantly. In all variants, samples cover the whole in-set, but they are overlapping. SeeFigure 2 for illustration. We use the following abbreviations for sequential validations: • seq(9:1, 20, equi) - 9:1 training:test ratio, 20 equidistant samples, • seq(9:1, 10, equi) - 9:1 training:test ratio, 10 equidistant samples, • seq(2:1, 10, semi-equi) - 2:1 training:test ratio, 10 samples randomly selected out of 20equidistant points. 8 ig 2. Sampling of an in-set for sequential validation. A sample consists of a training set, immediatelyfollowed by a test set. We consider two scenarios: (A) The ratio of the training and test set is 9:1, andthe sample is shifted along 10 or 20 equidistant points. (B) The training:test set ratio is 2:1 and thesample is positioned at 10 randomly selected points out of 20 equidistant points.

We compare six estimation procedures in terms of different types of errors they incur. The error isdefined as the difference to the gold standard. First, the magnitude and sign of the errors show whethera method tends to underestimate or overestimate the performance, and by how much (subsectionMedian errors). Second, relative errors give fractions of small, moderate, and large errors that eachprocedure incurs (subsection Relative errors). Third, we rank the estimation procedures in terms ofincreasing absolute errors, and estimate the significance of the overall ranking by the Friedman-Nemenyitest (subsection Friedman test). Finally, selected pairs of estimation procedures are compared by theWilcoxon signed-rank test (subsection Wilcoxon test).

An estimation procedure estimates the performance (abbreviated

Est ) of a model in terms of

Alpha and F . The error it incurs is defined as the difference to the gold standard performance (abbreviated Gold ): Err = Est − Gold . The validation results show high variability of the errors, with skeweddistribution and many outliers. Therefore, we summarize the errors in terms of their medians and9uartiles, instead of the averages and variances.The median errors of the six estimation procedures are in Tables 4 and 5, measured by

Alpha and F , respectively. Table 4.

Median errors, measured by

Alpha , for individual language datasets and six estimationprocedures.

Lang xval(9:1,strat, block) xval(9:1,no-strat, block) xval(9:1,strat, rand) seq(9:1,20, equi) seq(9:1,10, equi) seq(2:1,10, semi-equi)alb 0 .

052 0 .

036 0 .

206 0 .

001 0 .

001 0 . .

009 0 .

013 0 . − . − . − . − . − . − . − . − . − . .

037 0 .

049 0 .

059 0 .

009 0 .

010 0 . .

009 0 .

013 0 . − . − . − . .

011 0 .

016 0 . − . − . − . − . − . − . − . − . − . .

008 0 .

008 0 . − . − . − . − . − .

051 0 . − . − . − . .

018 0 .

015 0 . − . − . − . . − .

004 0 . − . − . − . − .

008 0 .

031 0 .

070 0 .

012 0 . − . .

055 0 .

057 0 .

106 0 .

011 0 . − . .

009 0 .

013 0 . − . − . − . Table 5.

Median errors, measured by F , for individual language datasets and six estimationprocedures. Lang xval(9:1,strat, block) xval(9:1,no-strat, block) xval(9:1,strat, rand) seq(9:1,20, equi) seq(9:1,10, equi) seq(2:1,10, semi-equi)alb 0 .

026 0 .

016 0 . − . − . − . .

020 0 .

024 0 .

047 0 . − . − . − . − . − . − . − . − . .

056 0 .

058 0 .

072 0 .

025 0 .

028 0 . .

022 0 .

022 0 . − . − . − . .

013 0 .

020 0 . − .

001 0 − . − . − . − . − . − . − . .

008 0 .

010 0 . − . − . − . − . − .

037 0 − . − . − . .

005 0 .

008 0 . − . − . − . .

003 0 0 . − . − . − . − .

001 0 .

024 0 .

060 0 .

007 0 .

010 0 . .

030 0 .

037 0 .

071 0 .

008 0 . − . .

008 0 .

016 0 . − . − . − . Figure 3 depicts the errors with box plots. The band inside the box denotes the median, the boxspans the second and third quartile, and the whiskers denote 1.5 interquartile range. The dotscorrespond to the outliers. Figure 3 shows high variability of errors for individual datasets. This is mostpronounced for the Serbian/Croatian/Bosnian (scb) and Portuguese (por) datasets where variation inannotation quality (scb) and a radical topic shift (por) were observed. Higher variability is also observedfor the Spanish (spa) and Albanian (alb) datasets, which have poor sentiment annotation quality(see [23] for details).The differences between the estimation procedures are easier to detect when we aggregate the errors10 l l llll lll l l l ll ll lll l l ll ll l ll ll l l l seq(9:1, 20, equi) seq(9:1, 10, equi) seq(2:1, 10, semi−equi)xval(9:1, strat, block) xval(9:1, no−strat, block) xval(9:1, strat, rand) a l b bu l eng ge r hun po l po r r u s sc b s l k s l v s pa s w e a l b bu l eng ge r hun po l po r r u s sc b s l k s l v s pa s w e a l b bu l eng ge r hun po l po r r u s sc b s l k s l v s pa s w e −0.2−0.10.00.10.20.3−0.2−0.10.00.10.20.3 D i s t r i bu t i on o f t he D i ff e r en c e t o G o l d S t anda r d Fig 3.

Box plots of errors of six estimation procedures for 13 language datasets. Errors are measured interms of

Alpha .over all language datasets. The results are in Figures 4 and 5, for

Alpha and F , respectively. In bothcases we observe that the cross-validation procedures (xval) consistently overestimate the performance,while the sequential validations (seq) underestimate it. The largest overestimation errors are incurred bythe random cross-validation, and the largest underestimations by the sequential validation with thetraining:test set ratio 2:1. We also observe high variability of errors, with many outliers. Theconclusions are consistent for both measures, Alpha and F .11 lllll llllllll llllll lllllllllll lllllllllll lllllllllllllll −0.2−0.10.00.10.20.3 xval(9:1, strat, block) xval(9:1, no−strat, block) xval(9:1, strat, rand) seq(9:1, 20, equi) seq(9:1, 10, equi) seq(2:1, 10, semi−equi) D i s t r i bu t i on o f t he D i ff e r en c e t o G o l d S t anda r d Fig 4.

Box plots of errors of six estimation procedures aggregated over all language datasets. Errors aremeasured in terms of

Alpha . 12 lllllllll llllllllllllll llllll llllllllllll llllllllllllll lllllllllllllllllll −0.3−0.2−0.10.00.10.2 xval(9:1, strat, block) xval(9:1, no−strat, block) xval(9:1, strat, rand) seq(9:1, 20, equi) seq(9:1, 10, equi) seq(2:1, 10, semi−equi) D i s t r i bu t i on o f t he D i ff e r en c e t o G o l d S t anda r d Fig 5.

Box plots of errors of six estimation procedures aggregated over all language datasets. Errors aremeasured in terms of F . 13 .2 Relative errors Another useful analysis of estimation errors is provided by a comparison of relative errors. The relativeerror is the absolute error an estimation procedure incurs divided by the gold standard result:

RelErr = | Est − Gold | /Gold . We chose two, rather arbitrary, thresholds of 5% and 30%, and classifythe relative errors as small ( RelErr < ≤ RelErr ≤ RelErr >

Alpha , for individuallanguage datasets. Again, we observe a higher proportion of large errors for languages with poorannotations (alb, spa), annotations of different quality (scb), and different topics (por). seq(9:1, 20, equi) seq(9:1, 10, equi) seq(2:1, 10, semi−equi)xval(9:1, strat, block) xval(9:1, no−strat, block) xval(9:1, strat, rand)alb bul eng ger hun pol por rus scb slk slv spa swe alb bul eng ger hun pol por rus scb slk slv spa swe alb bul eng ger hun pol por rus scb slk slv spa swe0.000.250.500.751.000.000.250.500.751.00

Lang P r opo r t i on o f It e r a t i on s w i t h D i ff e r en c e t o G o l d S t anda r d be t w een and Type > 30[ 5 , 30 ]< 5

Fig 6.

Proportion of relative errors, measured by

Alpha , per estimation procedure and individuallanguage dataset. Small errors ( < , > Alpha and F , respectively.The proportion of errors is consistent between Alpha and F , but there are more large errors when theperformance is measured by Alpha . This is due to smaller error magnitude when the performance ismeasured by

Alpha in contrast to F , since Alpha takes classification by chance into account. Withrespect to individual estimation procedures, there is a considerable divergence of the randomcross-validation. For both performance measures,

Alpha and F , it consistently incurs higher proportionof large errors and lower proportion of small errors in comparison to the rest of the estimationprocedures. 14 .000.250.500.751.00 x v a l ( : , s t r a t, b l o c k ) x v a l ( : , n o − s t r a t, b l o c k ) x v a l ( : , s t r a t, r a n d ) s e q ( : , , e q u i ) s e q ( : , , e q u i ) s e q ( : , , s e m i − e q u i ) P r opo r t i on o f D i ff e r en c e s t o G o l d S t anda r d be t w een and Type > 30[ 5 , 30 ]< 5

Fig 7.

Proportion of relative errors, measured by

Alpha , per estimation procedure and aggregated overall 138 datasets. Small errors ( < , > .000.250.500.751.00 x v a l ( : , s t r a t, b l o c k ) x v a l ( : , n o − s t r a t, b l o c k ) x v a l ( : , s t r a t, r a n d ) s e q ( : , , e q u i ) s e q ( : , , e q u i ) s e q ( : , , s e m i − e q u i ) P r opo r t i on o f D i ff e r en c e s t o G o l d S t anda r d be t w een and Type > 30[ 5 , 30 ]< 5

Fig 8.

Proportion of relative errors, measured by F , per estimation procedure and aggregated over all138 datasets. Small errors ( < , > .3 Friedman test The Friedman test is used to compare multiple procedures over multiple datasets [30–33]. For eachdataset, it ranks the procedures by their performance. It tests the null hypothesis that the average ranksof the procedures across all the datasets are equal. If the null hypothesis is rejected, one applies theNemenyi post-hoc test [34] on pairs of procedures. The performance of two procedures is significantlydifferent if their average ranks differ by at least the critical difference. The critical difference depends onthe number of procedures to compare, the number of different datasets, and the selected significancelevel.In our case, the performance of an estimation procedure is taken as the absolute error it incurs:

AbsErr = | Est − Gold | . The estimation procedure with the lowest absolute error gets the lowest (best)rank. The results of the Friedman-Nemenyi test are in Figures 9 and 10, for Alpha and F , respectively. xval(9:1, strat, block) xval(9:1, no−strat, block)seq(9:1, 20, equi) xval(9:1, strat, rand) seq(2:1, 10, semi−equi)seq(9:1, 10, equi) Critical Difference = 2.17 6 5 4 3 2 1

Average Rank

Fig 9.

Ranking of the six estimation procedures according to the Friedman-Nemenyi test. The averageranks are computed from absolute errors, measured by

Alpha . The black bars connect ranks that are notsignificantly different at the 5% level. xval(9:1, strat, block)xval(9:1, no−strat, block)seq(9:1, 20, equi)xval(9:1, strat, rand)seq(2:1, 10, semi−equi)seq(9:1, 10, equi)

Critical Difference = 2.17 6 5 4 3 2 1

Average Rank

Fig 10.

Ranking of the six estimation procedures according to the Friedman-Nemenyi test. The averageranks are computed from absolute errors, measured by F . The black bar connects ranks that are notsignificantly different at the 5% level.For both performance measures, Alpha and F , the Friedman rankings are the same. For sixestimation procedures, 13 language datasets, and the 5% significance level, the critical difference is 2 . F (Figure 10) all six estimation procedures are within the critical difference, so theirranks are not significantly different. In the case of Alpha (Figure 9), however, the two best methods aresignificantly better than the random cross-validation.17 .4 Wilcoxon test

The Wilcoxon signed-rank test is used to compare two procedures on related data [33, 35]. It ranks thedifferences in performance of the two procedures, and compares the ranks for the positive and negativedifferences. Greater differences count more, but the absolute magnitudes are ignored. It tests the nullhypothesis that the differences follow a symmetric distribution around zero. If the null hypothesis isrejected one can conclude that one procedure outperforms the other at a selected significance level.In our case, the performance of pairs of estimation procedures is compared at the level of languagedatasets. The absolute errors of an estimation procedure are averaged across the in-sets of a language.The average absolute error is then

AvgAbsErr = (cid:80) | Est − Gold | /L , where L is the number of in-sets.The results of the Wilcoxon test, for selected pairs of estimation procedures, for both Alpha and F , arein Figure 11. xval(9:1, strat, block)xval(9:1, strat, rand) xval(9:1, no-strat, block) seq(9:1, 20, equi)seq(9:1, 10, equi) seq(2:1, 10, semi-equi) p = 0.893 p = 0.006p = 0.002 p = 0.094 p = 0.068 Alpha xval(9:1, strat, block)xval(9:1, strat, rand) xval(9:1, no-strat, block) seq(9:1, 20, equi)seq(9:1, 10, equi) seq(2:1, 10, semi-equi) p = 0.735 p = 0.059p = 0.010 p = 0.027 p = 0.025 F₁ Fig 11.

Differences between pairs of estimation procedures according to the Wilcoxon signed-rank test.Compared are the average absolute errors, measured by

Alpha (top) and F (bottom). Thick solid linesdenote significant differences at the 1% level, normal solid lines significant differences at the 5% level,and dashed lines insignificant differences. Arrows point from a procedure which incurs smaller errors toa procedure with larger errors.The Wilcoxon test results confirm and reinforce the main results of the previous sections. Among thecross-validation procedures, blocked cross-validation is consistently better than the randomcross-validation, at the 1% significance level. Stratified approach is better than non-stratified, butsignificantly (5% level) only for F . The comparison of the sequential validation procedures is lessconclusive. The training:test set ratio 9:1 is better than 2:1, but significantly (at the 5% level) only for Alpha . With the ratio 9:1 fixed, 20 samples yield better performance estimates than 10 samples, butsignificantly (5% level) only for F . We found no significant difference between the best cross-validationand sequential validation procedures in terms how well they estimate the average absolute errors.18 Conclusions

In this paper we present an extensive empirical study about the performance estimation procedures forsentiment analysis of Twitter data. Currently, there is no settled approach on how to properly evaluatemodels in such a scenario. Twitter time-ordered data shares some properties of static data for textmining, and some of time series data. Therefore, we compare estimation procedures developed for bothtypes of data.The main result of the study is that standard, random cross-validation should not be used whendealing with time-ordered data. Instead, one should use blocked cross-validation, a conclusion alreadycorroborated by Bergmeir et al. [12, 20]. Another result is that we find no significant differences betweenthe blocked cross-validation and the best sequential validation. However, we do find thatcross-validations typically overestimate the performance, while sequential validations underestimate it.The results are robust in the sense that we use two different performance measures, severalcomparisons and tests, and a very large collection of data. To the best of our knowledge, we analyze andprovide by far the largest set of manually sentiment-labeled tweets publicly available.There are some biased decisions in our creation of the gold standard though, which limit thegenerality of the results reported, and should be addressed in the future work. An out-set alwaysconsists of 10,000 tweets, and immediately follows the in-sets. We do not consider how the performancedrops over longer out-sets, nor how frequently should a model be updated. More importantly, weintentionally ignore the issue of dependent observations, between the in- and out-sets, and between thetraining and test sets. In the case of tweets, short-term dependencies are demonstrated in the form ofretweets and replies. Medium- and long-term dependencies are shaped by periodic events, influentialusers and communities, or individual user’s habits. When this is ignored, the model performance is likelyoverestimated. Since we do this consistently, our comparative results still hold. The issue of dependentobservations was already addressed for blocked cross-validation [21, 37] by removing adjacentobservations between the training and test sets, thus effectively creating a gap between the two. Finally,it should be noted that different Twitter language datasets are of different sizes and annotation quality,belong to different time periods, and that there are time periods in the datasets without any manuallylabeled tweets.

Data and code availability

All Twitter data were collected through the public Twitter API and are subject to the Twitter termsand conditions. The Twitter language datasets are available in a public language resource repository clarin.si at http://hdl.handle.net/11356/1054 , and are described in [23]. There are 15 languagefiles, where the Serbian/Croatian/Bosnian dataset is provided as three separate files for the constituentlanguages. For each language and each labeled tweet, there is the tweet ID (as provided by Twitter), thesentiment label (negative, neutral, or positive), and the annotator ID (anonymized). Note that Twitterterms do not allow to openly publish the original tweets, they have to be fetched through the TwitterAPI. Precise details how to fetch the tweets, given tweet IDs, are provided in Twitter APIdocumentation https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-lookup . However, upon request to the corresponding author, abilateral agreement on the joint use of the original data can be reached.The TwoPlaneSVMbin classifier and several other machine learning algorithms are implemented inan open source LATINO library [36]. LATINO is a light-weight set of software components for buildingtext mining applications, openly available at https://github.com/latinolib .All the performance results, for gold standard and the six estimation procedures, are provided in aform which allows for easy reproduction of the presented results. The R code and data files needed toreproduce all the figures and tables in the paper are available at http://ltorgo.github.io/TwitterDS/ . 19 cknowledgements Igor Mozetiˇc and Jasmina Smailovi´c acknowledge financial support from the H2020 FET projectDOLFINS (grant no. 640772), and the Slovenian Research Agency (research core funding no. P2-0103).Luis Torgo and Vitor Cerqueira acknowledge financing by project “Coral - Sustainable OceanExploitation: Tools and Sensors/NORTE-01-0145-FEDER-000036”, financed by the North PortugalRegional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement,and through the European Regional Development Fund (ERDF).We thank Miha Grˇcar and Saˇso Rutar for valuable discussions and implementation of the LATINOlibrary.