[PDF] If you've got it, flaunt it: Making the most of fine-grained sentiment annotations

Abstract

Fine-grained sentiment analysis attempts to extract sentiment holders, targets and polar expressions and resolve the relationship between them, but progress has been hampered by the difficulty of annotation. Targeted sentiment analysis, on the other hand, is a more narrow task, focusing on extracting sentiment targets and classifying their polarity.In this paper, we explore whether incorporating holder and expression information can improve target extraction and classification and perform experiments on eight English datasets. We conclude that jointly predicting target and polarity BIO labels improves target extraction, and that augmenting the input text with gold expressions generally improves targeted polarity classification. This highlights the potential importance of annotating expressions for fine-grained sentiment datasets. At the same time, our results show that performance of current models for predicting polar expressions is poor, hampering the benefit of this information in practice.

Full PDF

IIf you’ve got it, ﬂaunt it:Making the most of ﬁne-grained sentiment annotations

Jeremy Barnes, Lilja Øvrelid, and Erik Velldal

University of OsloDepartment of Informatics { jeremycb,liljao,erikve } @ifi.uio.no Abstract

Fine-grained sentiment analysis attempts toextract sentiment holders, targets and polar ex-pressions and resolve the relationship betweenthem, but progress has been hampered by thedifﬁculty of annotation.

Targeted sentimentanalysis, on the other hand, is a more narrowtask, focusing on extracting sentiment targetsand classifying their polarity.In this paper, we explore whether incorpo-rating holder and expression information canimprove target extraction and classiﬁcationand perform experiments on eight Englishdatasets. We conclude that jointly predictingtarget and polarity BIO labels improves targetextraction, and that augmenting the input textwith gold expressions generally improves tar-geted polarity classiﬁcation. This highlightsthe potential importance of annotating expres-sions for ﬁne-grained sentiment datasets. Atthe same time, our results show that perfor-mance of current models for predicting polarexpressions is poor, hampering the beneﬁt ofthis information in practice.

Sentiment analysis comes in many ﬂavors, ar-guably the most complete of which is what is of-ten called ﬁne-grained sentiment analysis (Wiebeet al., 2005; Liu, 2015). This approach models thesentiment task as minimally extracting all opinionholders, targets, and expressions in a text and re-solving the relationships between them. This com-plex task is further complicated by interactions be-tween these elements, strong domain effects, andthe subjective nature of sentiment. Take the anno-tated sentence in Figure 1 as an example. Knowingthat the target “UMUC” is modiﬁed by the expres-sion “5 stars” and not “don’t believe” is importantto correctly classifying the polarity. Additionally,the fact that this is a belief held by “some others” as apposed to the author of the sentence can helpus determine the overall polarity expressed in thesentence.Compared to document- or sentence-level sen-timent analysis, where distant labelling schemescan be used to obtain annotated data, ﬁne-grainedannotation of sentiment does not occur naturally,which means that current machine learning mod-els are often hampered by the small size ofdatasets. Furthermore, ﬁne-grained annotation isdemanding, leads to relatively small datasets, andhas low inter-annotator agreement (Wiebe et al.,2005; Wang et al., 2017). This begs the question: is it worth it to annotate full ﬁne-grained senti-ment ?Targeted sentiment (Mitchell et al., 2013; Zhanget al., 2015) is a reduction of the ﬁne-grained sen-timent task which concentrates on extracting sen-timent targets and classifying their polarity, effec-tively ignoring sentiment holders and expressions.The beneﬁt of this setup is that it is faster to an-notate and simpler to model. But would targetedsentiment models beneﬁt from knowing the sen-timent holders and expressions ?In this work, we attempt to determine whetherholder and expression information is useful forextracting and then classifying sentiment targets.Speciﬁcally, we ask the following research ques-tions:

RQ1:

Given the time and difﬁculty requiredto annotate opinion holders, expressions, andpolarity, is this information useful to extractsentiment targets ?(a) Does augmenting the input text withholders and expressions improve targetextraction?(b) Do target extraction models beneﬁt frompredicting holders and expressions? a r X i v : . [ c s . C L ] J a n ave seen some others giving UMUC { } - { don’t believe } them. targetholder target Figure 1: An opinion annotation from the Darmstadt Review Corpus. (c) Do target extraction models beneﬁt frompredicting the polarity of targets and/orexpressions?

RQ2:

Can holder and expression informationimprove polarity classiﬁcation on extractedtargets?(a) Does augmenting the input text withholders and expressions improve polar-ity classiﬁcation?(b) Do potential beneﬁts of augmenting theinput depend on how we model the tar-get, i.e., using the [CLS] embeddings,mean pooling the target embeddings,etc.?(c) Can sentiment lexicons provide enoughinformation on expressions to give im-provements?We conduct a series of experiments on eightEnglish sentiment datasets (three with full ﬁne-grained sentiment and ﬁve targeted) with state-of-the-art models based on ﬁne-tuned BERT models.We show that (1) it is possible to improve targetextraction by also trying to predict the polarity,and that (2) classiﬁcation models beneﬁt from hav-ing access to information about sentiment expres-sions. We also (3) release the code to reproducethe experiments, as well as the scripts to down-load, preprocess, and collect the datasets into acompatible JSON format, with the hope that thisallows future research on the same data. Fine-grained approaches to sentiment analysis at-tempt to discover opinions from text, where eachopinion is a tuple of (opinion holder, opinion tar-get, opinion expression, polarity, intensity) . An-notation of datasets for this granularity requirescreating in-depth annotation guidelines, training https://github.com/ltgoslo/finegrained_modelling annotators, and generally leads to lower inter-annotator scores than other sentiment tasks, e.g., document- or sentence-level classiﬁcation, as de-ciding on the spans for multiple elements and theirrelationships is undeniably harder than choosing asingle label for a full text. Targeted sentiment ,on the other hand, generally concentrates only ontarget extraction and polarity classiﬁcation. Thishas the beneﬁt of allowing non-experts and crowd-sourcing to perform annotation, making it easier tocollect larger datasets for machine learning. Thissimpliﬁed annotation can be crowd-sourced, lead-ing to larger datasets for machine learning.

The Multi-purpose Question Answering dataset(

MPQA ) (Wiebe et al., 2005) is the ﬁrst datasetthat annotated opinion holders, targets, expres-sions and their relationships. The news wire dataleads to complex opinions and a generally difﬁ-cult task for sentiment models. Normally, the fullopinion extraction task is modelled as extractionof the individual elements (holders, targets, andexpressions) and the subsequent resolution of therelationship between them.The

Darmstadt Review Corpora (Toprak et al.,2010) contain annotated opinions for consumer re-views of universities and services. The authors an-notate holders, targets, expressions, polarity, mod-iﬁers, and intensity. They achieve between 0.5 and0.8 agreement using the agr method (Wiebe et al.,2005), with higher disagreement on what they call“polar targets” – targets that have a polarity but noannotated sentiment expression – holders, and ex-pressions.The

Open

Domain Targeted dataset (Mitchellet al., 2013) makes use of crowd sourcing to anno-tate NEs from scraped tweets in English and Span-ish (Etter et al., 2013) with their polarities. The au-thors use majority voting to assign the ﬁnal labelsfor the NEs, discarding tweets without sentimentconsensus on all NEs.he 2014

SemEval shared task (Pontiki et al.,2014) on aspect-based sentiment analysis includelabeled data from restaurant and laptop reviews fortwo subtasks: 1) target extraction, which they call“aspect term extraction” and 2) classiﬁcation ofpolarity with respect to targets (“aspect term po-larity”).As most targeted datasets only contain a singletarget, or multiple targets with the same polarity,sentence-level classiﬁers are strong baselines. Inorder to mitigate this, Jiang et al. (2019) createa

Challenge dataset which has both multiple tar-gets and multiple polarities in each sentence. Sim-ilarly, Wang et al. (2017) also point out that mosttargeted sentiment methods perform poorly withmultiple targets and propose

TDParse , a corpus ofUK election tweets with multiple targets per tweet.

Katiyar and Cardie (2016) explore jointly extract-ing holders, targets, and expressions with LSTMs.They ﬁnd that adding sentence-level and relation-level dependencies ( IS - FROM or IS - ABOUT ) im-prove extraction, but ﬁnd that the LSTM modelslag behind CRFs with rich features.Regarding modelling the interaction betweenelements, there are several previous attempts tojointly learn to extract and classify targets, usingfactor graphs (Klinger and Cimiano, 2013), multi-task learning (He et al., 2019) or sequence taggingwith collapsed tagsets representing both tasks (Liet al., 2019). In general, the beneﬁts are small andhave suggested that there is only a weak relation-ship between target extraction and polarity classi-ﬁcation (Hu et al., 2019).

One of the difﬁculties of working with ﬁne-grained sentiment analysis is that there are onlya few datasets (even in English) and they comein incompatible, competing data formats, e.g.,

BRAT or various ﬂavors of XML. With the goalof creating a simple uniﬁed format to work onﬁne-grained sentiment tasks, we take the eightdatasets mentioned in Section 2 –

MPQA (Wiebeet al., 2005),

Darmstadt Services and Universi-ties (Toprak et al., 2010),

TDParse (Wang et al.,2017),

SemEval Restaurant and Laptop (Pontikiet al., 2014),

Open

Domain Targeted Sentiment(Mitchell et al., 2013), and the

Challenge datasetfrom Jiang et al. (2019) – and convert them to a standard

JSON format. The datasets are sen-tence and word tokenized using NLTK (Loper andBird, 2002), except for MPQA, DS. Service andDS. Uni, which already contain sentence and to-ken spans. All polarity annotations are mappedto positive, negative, neutral, and conﬂict . Assuch, each sentence contains a sentence id, the to-kenized text, and a possibly empty set of opinionswhich contain a holder, target, expression, polar-ity, and intensity. We allow for empty holders andexpressions in order generalize to the targeted cor-pora. Finally, we use 10 percent of the trainingdata as development and another 10 percent fortest for the corpora that do not contain a suggestedtrain/dev/test split. For training and testing mod-els, however, we convert the datasets to CoNLLformat.Table 1 presents an overview of the differentdatasets and highlights important differences be-tween them. The fully ﬁne-grained sentimentdatasets (MPQA, DS. Services, and DS. Uni) tendto be larger but have fewer targets annotated, dueto a larger number of sentences with no targets.However, the MPQA dataset contains much longertargets than the other datasets – an average of 6,but a maximum of 56 tokens. It also containsmore opinion holders and expressions and thesealso tend to be longer, all of which marks MPQAas an outlier among the datasets. The distribu-tion of polarity is also highly dependent on thedataset, with DS. Services being the most skewedand SemEval Laptop the least skewed. Finally, thechallenge dataset is by far the largest with over11,000 training targets. Additionally, Table 6 inAppendix A shows the percentage of unique tar-gets per dataset, as well as the percentage of tar-gets shared between the training set and the devand test sets. Again MPQA has the largest num-ber of unique targets and the least overlap. We split the task of targeted sentiment analysisinto the extraction of sentiment targets and sub-sequent polarity classiﬁcation of extracted tar-gets, given their context. Figure 2 shows thetwo tasks and the eight models used in the ex-periments. As a base model, we take the tar- We discard conﬂict during evaluation because there arenot enough examples to properly learn this class in mostdatasets We do not, however, consider partial overlap which mayexaggerate the true uniqueness of targets. omain sentences holders targets expressions polarity + neu − F i n e - g r a i n e d S e n ti m e n t MPQA newswire train 4500 25 1306 2.6 27 1382 6.1 56 1656 2.4 14 675 271 658dev 1622 23 377 2.6 16 449 5.3 41 552 2.1 8 241 105 202test 1681 24 371 2.8 32 405 6.4 42 479 2.0 8 166 89 199DS. Services service train 5913 16 18 1.2 2 2504 1.2 7 1273 1.2 10 1623 46 838reviews dev 744 18 1 1.7 3 288 1.2 4 144 1.4 5 103 1 104test 748 17 2 1 1 328 1.2 5 168 1.4 6 241 7 80DS. Uni university train 2253 20 65 1.2 2 1252 1.2 5 837 1.9 9 495 149 610reviews dev 232 9 17 1.1 3 151 1.2 3 106 1.7 6 40 19 92test 318 20 12 1.3 4 198 1.2 6 139 2.0 5 77 18 103 T a r g e t e d S e n ti m e n t TDParse political train 2889 6.9 - - - 9088 1.2 7 - - - 1238 3931 3919tweets dev 321 6.6 - - - 1040 1.2 5 - - - 128 454 458test 867 6.9 - - - 2746 1.2 6 - - - 378 1162 1206SemEval R. restaurant train 2740 13 - - - 3293 1.4 19 - - - 1902 574 734reviews dev 304 11.3 - - - 350 1.4 5 - - - 226 54 63test 800 9.6 - - - 1128 1.4 8 - - - 724 195 195SemEval L. laptop train 2744 22.5 - - - 2049 1.5 6 - - - 870 402 747reviews dev 304 21.1 - - - 244 1.6 5 - - - 99 44 96test 800 18.6 - - - 633 1.6 7 - - - 327 162 128Open tweets train 1903 12.8 - - - 2594 1.6 8 - - - 578 1801 215dev 211 12.3 - - - 291 1.6 6 - - - 46 220 25test 234 11.6 - - - 337 1.6 7 - - - 74 232 31Challenge restaurant train 4297 8.8 - - - 11186 1.3 9 - - - 3380 5042 2764reviews dev 500 8.9 - - - 1332 1.3 8 - - - 403 604 325test 500 8.9 - - - 1336 1.3 8 - - - 400 607 329

Table 1: Stastistics of the datasets, including number of sentences, as well as average, and max lengths (in tokens)for holder, target, and expression annotations. Additionally, we include the distribution of polarity – restricted topositive, neutral, and negative – in each dataset. get extraction and classiﬁcation models from Xuet al. (2019), which achieve state-of-the-art per-formance on the SemEval task. The approach ﬁrstﬁne-tunes BERT (Devlin et al., 2019) on domain-speciﬁc unlabeled data as a domain-adaptationstep. We use the datasets themselves to per-form this step, except for the SemEval datasets.For these, we follow Rietzler et al. (2020) andinstead use larger amounts of unlabeled data –1,710,553 and 2,000,000 sentences for SemEvalLaptop and Restaurant respectively – taken fromAmazon Laptop reviews (He and McAuley, 2016)and the Yelp Dataset Challenge. We further de-viate from Xu et al. (2019) by not pretrainingthe models on the SQUAD question answeringdataset and in-domain sentiment questions whichthey create, as this data is not publicly available.Finally, a linear prediction is added after the BERTmodel and the full model is updated on the senti-ment task.For target extraction , we use the contextual-ized BERT embeddings as input to a softmax layer and predict the sequence of tags. We comparethree prediction strategies:1. T ARG .: The model predicts the labels y ∈{ B,I,O } for the targets only.2. P RED .: We additionally predict the la-bels for holders and expressions and predict y ∈ { B-holder , I-holder , B-target , I-target , B-expression , I-expression , O } .3. +P OL .: Finally, we add the polarity ( posi-tive , negative , neutral ) to the annotation spe-ciﬁc BIO-tag, which leads to an inventory of19 labels for the full ﬁne-grained setup and 7for the targeted setup.For polarity classiﬁcation , we take as a base-line the classiﬁcation architecture from Xu et al.(2019), which makes use of the two-sentencetraining procedure for BERT, by prepending thetarget before the sentence separation token, andthen adding the full sentence after. We compareﬁve strategies for producing the input to the soft-max layer for predicting the sentiment of the tar-get:1. [C LS ]: this model uses the [CLS] embed-ding from the ﬁnal BERT layer. ERT [CLS] I like the pizza here [SEP]the pizza [SEP] MeanMaxMax Mean MinFirst[CLS]1) 2)3)4)5)[CLS] I like the pizza here [SEP]

BERT

Extraction Classi ﬁ cation [CLS] I like the pizza here [SEP] [CLS] I like the pizza here [SEP]the pizza [SEP]Targ. O O O B I O O1) Pred. O B-hol B-E B-T I-T O O2) +Pol. O B-hol B-E+ B-E+ I-T+ O O3) Figure 2: Our BERT-based target extraction and classiﬁcation models , with the three strategies for extraction ((1)predict only targets, (2) predict holders, targets and expressions, and (3) predict the polarity of the targets andexpressions as well) and ﬁve strategies for sentiment classiﬁcation (passing to the softmax layer the contextualizedembedding from (1) the [CLS] embedding, (2) the ﬁrst token in the target (3) averaging all embeddings in thetarget phrase, (4) taking the max of the target embeddings, (5) concatenating the max, mean, and min). F IRST : uses the contextualized BERT em-bedding from the ﬁrst token of the target incontext.3. M EAN : instead takes the average of theBERT embeddings for the tokens in the tar-get.4. M AX : uses the max of the contextualizedBERT embeddings for the tokens in the tar-get.5. M AX MM: takes the max, min, and meanpooled representations and passes the con-catenation to the softmax layer, which hasshown to perform well for sentiment tasks(Tang et al., 2014). However, this triples thesize of the input representation to the softmaxlayer.The T

ARG . and [C LS ] models correspond to themodels used in Xu et al. (2019) and serve as base-lines. The extraction and classiﬁcation models areﬁne-tuned for 50 epochs using Adam with an ini-tial learning rate of − , with a linear warmupof 0.1 and all other hyperparameters are left at de-fault BERT settings (further details in AppendixB). The best model on the development set is usedfor testing. Combined with the four input manip-ulations (Table 2), this leads to eleven extractionexperiments – T ARG . and P

RED . on the originaldata which only has annotated targets are the sameand for simplicity we only show the results fromT

ARG .– and twenty classiﬁcation experiments perdataset. In order to control for the effect of randominitialization, we run each experiment 5 times ondifferent random seeds and report the mean andstandard deviation.

Given that we are interested in knowing whetherit is beneﬁcial to include information about ad-ditional annotations (holder, expressions, polar-ity), we perform experiments where we system-atically include these. We do so by adding spe-cial tags, e.g., , (cid:2) < E (cid:3) , into the input text surround-ing the annotated spans, as shown in Table 2. Themodels then have access to this information bothduring training and at test time, albeit in an in-direct way. For the ﬁrst set of experiments, weperform controlled experiments under ideal condi-tions, i.e., having gold annotations during testing.This allows us to isolate the effects of incorporat-ing the additional annotations, without worryingabout noisy predictions It is equally important to know whether the mod-els are able to use noisy predicted annotations. Inorder to test this, we train expression predictionmodels on the three full ﬁne-grained sentimentcorpora. We use the same BERT-based model andhyperparameters from the target extraction modelsabove and train ﬁve models with different randomseeds. Preliminary results suggested that thesemodels had high precision, but low recall. There-fore, we take a simple ensemble of the ﬁve trainedmodels, where for each token, we keep labels pre-dicted by at least one of the expression models inorder to increase recall.We perform an additional set of experimentswhere we use sentiment lexicons and assume anyword in these lexicons is a sentiment expres- riginal Money Magazine rated E-Trade highly .+ holders (cid:2) < H (cid:3) Money Magazine (cid:2) H > (cid:3) rated E-Trade highly+ expressions Money Magazine (cid:2) < E (cid:3) rated (cid:2) E > (cid:3) E-Trade (cid:2) < E (cid:3) highly (cid:2) E > (cid:3) + full (cid:2) < H (cid:3) Money Magazine (cid:2) H > (cid:3) (cid:2) < E (cid:3) rated (cid:2) E > (cid:3) E-Trade (cid:2) < E (cid:3) highly (cid:2) E > (cid:3) Table 2: We inform our models regarding annotations other than targets by inserting special tags into the input textbefore and after annotated holders and expressions . sion. We use the

Hu and Liu lexicon (Hu andLiu, 2004), the

SoCal and

SoCal-Google lexicons(Taboada et al., 2006) and the

NRC emotion lex-icon (Mohammad and Turney, 2013), which alsocontains sentiment annotations. The lexicons con-tain 6,789, 5,824, 2,142, and 5,474 entries, respec-tively. The MPQA and Darmstadt experimentsshow the effect of predicted vs. gold expressions,as well as domain transfer. The experiments onthe targeted datasets, on the other hand, will showus whether it is possible to improve the targetedmodels with predicted expressions.

In this section we describe the main results fromthe extraction and two classiﬁcation experimentsdescribed in Section 4.

Table 3 shows the results for the extraction exper-iment, where token-level F is measured only ontargets. The models perform poorer than the state-of-the-art, as we did not ﬁnetune on the SQUADquestion answering dataset and in-domain senti-ment questions or perform extensive hyperparam-eter tuning. The average F score depends highlyon the dataset – MPQA is the most difﬁcult datasetwith 13.1 F on the original data, while the Darm-stadt Universities corpus is the easiest for targetextraction with 84.6. Augmenting the input textwith further annotations, but predicting only sen-timent targets (T ARG . in Table 3) hurts the modelperformance in all cases. Speciﬁcally, addingholder tags leads to an average drop of 1.3 per-centage points (pp), expressions 1.2 and full 1.5.Attempting to additionally predict these annota-tions (P

RED . in Table 3) leads to mixed results– the model leads to improvements on MPQA +exp. and Darmstadt Services + holders, no notabledifference on MPQA + full and Darmstadt Univer-sities + exp., and a loss on the rest. Adding the polarity to the target BIO tags (orig-inal +P OL . in Table 3) leads to the most consistentimprovements across experiments – an average of0.5 pp – with the largest improvement of 1.5 ppon the TDParse dataset. This suggests a weak-to-moderate relationship between polarity and ex-traction, which contradicts previous conclusions(Hu et al., 2019). Finally, further adding the holderand expression tags (+P OL . in Table 3) tends todecrease performance. Table 4 shows the macro F scores for the po-larity classiﬁcation task on the gold targets. Themodel performs better than the best reported re-sults on Challenge (Jiang et al., 2019), and similarto previous results on the SemEval corpora. Re-garding the choice of target representation, F IRST is the strongest overall, with an average of 64.7F across the original eight datasets, followed byM AX (64.6), M EAN (64.4), M AX MM (64.2), andﬁnally [C LS ] (64.1). It is, however, unclear ex-actly which representation is the best, as it dif-fers for each dataset. But we can conclude that[C LS ] is in general the weakest model, while ei-ther F IRST or M AX provide good starting points.Adding holder annotations to the input text de-livers only small improvements on four of the ﬁf-teen experiments, and has losses on seven. The+exp. model, however, leads to signiﬁcant im-provements on 10 experiments. The outlier seemsto be Darmstadt Services, which contains a largenumber of “polar targets” in the data, which donot have polar expressions. This may explain whyincluding this information has less effect on thisdataset. Finally, +full performs between the origi-nal input and +exp. PQA DS. Services DS. Unis Challenge SemEval R. SemEval L. Open TDParseXu et al. (2019) n/a n/a n/a n/a 78.0 84.3 n/a n/aBiLSTM-CRF 12.2 (1) (1) (1) (1) (1) (1) (1) (2) (1) (0) (1) (1) (1) (4) (3) T A R G . + holders 11.9 (1) (1) (1) - - - - -+ exp. 11.6 (1) (0) (0) - - - - -+ full 10.5 (2) (1) (1) - - - - - P R E D . + holders 12.1 (2) (0) (0) - - - - -+ exp. (1) (1) (1) - - - - -+ full 13.0 (3) (1) (1) - - - - - + P O L . BiLSTM-CRF 13.9 (1) (1) (1) (1) (1) (1) (1) (1) original 13.8 (1) (1) (1) (1) (1) (1) (1) (0) + holders 13.8 (2) (1) (1) - - - - -+ exp. 13.5 (2) (1) (0) - - - - -+ full 12.0 (1) (1) (0) - - - - -

Table 3: Average token-level F1 scores for the target extraction task across ﬁve runs, (standard deviation in paren-thesis).

Bold numbers indicate the best model per dataset, while blue and pink highlighting indicates animprovement or loss in performance compared to the original data, respectively.

The expression models achieve modest F scoreswhen trained and tested on the same dataset –between 15.0 and 47.9 –, and poor scores whentransferred to a different dataset – between 0.9 and14.9 (further details shown in Table 7 in AppendixA). The lexicons often provide better cross-datasetF than the expression models trained on anotherdataset, as they have relatively good precision ongeneral sentiment terms.Figure 3 shows a heatmap of improvements(blue) and losses (red) on the eight datasets (x-axis) when augmenting the input text with expres-sion tags from the expression models and lexicons(y-axis). We compare the expression augmentedresults to the original results for each pooling tech-nique and take the average of these improvementsand losses. For a full table of all results, see Table5 in Appendix A.Augmenting the input text with predicted sen-timent expressions leads to losses in 41 out ofaveraged 56 experiments shown in Figure 3 (orin 173 out of 280 experiments in Table 5). Cu-riously, the experiments that use an expressionmodel trained on the same dataset as the classiﬁca-tion task, e.g., MPQA predicted expressions on theMPQA classiﬁcation task, have the largest losses– the largest of which is MPQA (-2.78 on aver-age). This seems to indicate that the mismatch be- tween the train prediction, which are near perfect,and the rather poor test predictions is more prob-lematic than cross-dataset predictions, which aresimilar on train and test.The best expression prediction model is theone trained on MPQA, improving the performanceon Darmstadt Universties, Open, and SemEvalRestaurants. This is likely due to the fact thatMPQA has the largest number of annotated ex-pressions, and that the domain is more general,leading to expression predictions that generalizebetter. The expression models trained on Darm-stadt Services leads to small beneﬁts on two cor-pora and the expression model trained on Darm-stadt Universities only leads to lossesThe datasets that receive the most beneﬁt fromexpression annotations are Darmstadt Universi-ties (6/7 experiments) and the TDParse dataset(5/7). In both cases, the lexicon-based expres-sion models provide more consistent beneﬁts thanthe trained expression prediction models. Thefact that the dataset that beneﬁts most is the TD-Parse dataset suggests that expression informationis most useful when there are multiple targets withmultiple polarities.There is no signiﬁcant correlation between theperformance of the expression prediction modeland the performance on the classiﬁcation task onthe three ﬁne-grained datasets. In fact, there is asmall but insigniﬁcant negative correlation (-0.33

PQA DS. Services DS. Unis Challenge SemEval R. SemEval L. Open TDParsePrevious Results n/a n/a n/a 70.3 80.1 78.3 [ C L S ] original 63.5 (2) (1) (4) (0) (2) (1) (1) (1) + holders 63.1 (2) (1) (0) - - - - -+ exp. 64.0 (3) (0) (4) - - - - -+ full 61.9 (2) (1) (2) - - - - - F I R S T original 64.3 (2) (1) (4) (1) (1) (1) (2) (1) + holders 63.4 (2) (2) (3) - - - - -+ exp. (2) (1) (2) - - - - -+ full 64.0 (1) (1) (4) - - - - - M E AN original 63.5 (2) (1) (4) (1) (2) (1) (3) (1) + holders 63.1 (2) (1) (5) - - - - -+ exp. 64.3 (2) (1) (3) - - - - -+ full 64.2 (2) (1) (2) - - - - - M AX original 60.8 (4) (1) (3) (1) (2) (2) (5) (3) + holders 61.9 (4) (1) (1) - - - - -+ exp. 64.3 (2) (1) (6) - - - - -+ full 62.7 (3) (1) (2) - - - - - M AX MM original 59.3 (2) (1) (3) (1) (1) (1) (5) (5) + holders 61.3 (1) (1) (3) - - - - -+ exp. 64.1 (2) (3) (2) - - - - -+ full 63.9 (1) (1) (4) - - - - - Table 4: Average macro F scores for polarity classiﬁcation across ﬁve runs (standard deviation in parenthesis) ongold targets, also adding information about holders and expressions. Bold indicates the best model per dataset,while blue and pink highlighting indicates an improvement or loss in performance compared to the original(targets only) data, respectively. p = = = respectively, as measuredby Pearson’s correlation between the expressionperformances and the F of the classiﬁcation mod-els augmented with these predicted expressions).It seems that the possible beneﬁts depends moreon the target dataset than the actual expressionmodel used. In this work we have explored the beneﬁt of aug-menting targeted sentiment models with holderand sentiment expressions. The experiments haveshown that although augmenting text with holderand expression tags (

RQ1 a ) or simultaneouslypredicting them (

RQ1 b ) have no beneﬁt for tar-get extraction, predicting collapsed BIO + po-larity tags consistently improves target extraction(

RQ1 c ). Furthermore, augmenting the input textwith gold expressions generally improves targetedpolarity classiﬁcation (

RQ2 a ), although it is notclear which target representation strategy is best(

RQ2 b ). Furthermore, we have found beneﬁts of M P Q A D S S e r v i c e s D S U n i s C h a ll e n g e O p e n S e m E v a l R . S e m E v a l L . T D P a r s e Datasets

MPQADS ServicesDS UnisHuLiuNRCSoCalSoCal-G E x p r e ss i o n p r e d i c t i o n m o d e l s -2.78 -0.24 0.52 -0.52 1.44 0.24 -0.14 -0.90-0.50 -1.56 0.12 -1.54 -0.18 -0.22 -0.36 0.64-0.50 -0.44 -1.12 -1.32 -0.42 -1.46 -1.04 -0.90-1.82 -1.34 0.98 -1.22 -2.34 -0.72 -0.14 0.480.18 -0.42 2.02 -1.02 -1.32 -1.06 0.28 1.08-0.86 -1.44 0.30 -1.02 -2.20 -3.52 -1.82 1.46-0.70 -0.86 0.28 -1.18 -2.10 -1.74 -0.38 0.58 432101234 Figure 3: Heatmap of average improvements (blue) andlosses (red) on the target classiﬁcation tasks (x-axis)when augmenting the input text with predicted senti-ment expressions from the expression prediction mod-els (y-axis). ncluding lexicon-based expressions for the morecomplex targeted datasets (

RQ2 c ).The rather poor performance of the learned ex-pression models and the difference between aug-menting with gold or predicted expressions re-veals the need to improve expression predictionapproaches, both by creating larger corpora an-notated with sentiment expressions, as well asperforming further research on the modeling as-pect. Any future work interested in modellingmore complex sentiment phenomena should there-fore be aware that we may ﬁrst require more high-quality annotated data if we wish to do so with cur-rent state-of-the-art machine learning approaches.Furthermore, we introduce a common formatfor eight standard English datasets in ﬁne-grainedsentiment analysis and release the scripts to down-load and preprocess them easily. We plan to in-clude further datasets in our script in the future,as well as extending our work to other languageswith available ﬁne-grained corpora.

References

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.David Etter, Francis Ferraro, Ryan Cotterell, OliviaBuzek, and Benjamin Van Durme. 2013. Nerit:Named Entity Recognition for Informal Text. Tech-nical Report 11, Human Language Technology Cen-ter of Excellence, Johns Hopkins University, Balti-more, Maryland.Ruidan He, Wee Sun Lee, Hwee Tou Ng, and DanielDahlmeier. 2019. An interactive multi-task learn-ing network for end-to-end aspect-based sentimentanalysis. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 504–515, Florence, Italy. Association forComputational Linguistics.Ruining He and Julian McAuley. 2016. Ups anddowns: Modeling the visual evolution of fashiontrends with one-class collaborative ﬁltering. In

Pro-ceedings of the 25th International Conference onWorld Wide Web , WWW ’16, page 507–517, Repub-lic and Canton of Geneva, CHE. International WorldWide Web Conferences Steering Committee. Minghao Hu, Yuxing Peng, Zhen Huang, DongshengLi, and Yiwei Lv. 2019. Open-domain targeted sen-timent analysis via span-based extraction and clas-siﬁcation. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 537–546, Florence, Italy. Association forComputational Linguistics.Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In

Proceedings of the 10thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining , pages 168–177,Seattle, USA.Qingnan Jiang, Lei Chen, Ruifeng Xu, Xiang Ao, andMin Yang. 2019. A challenge dataset and effec-tive models for aspect-based sentiment analysis. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 6279–6284, Hong Kong, China. Association for Computa-tional Linguistics.Arzoo Katiyar and Claire Cardie. 2016. InvestigatingLSTMs for joint extraction of opinion entities andrelations. In

Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 919–929, Berlin,Germany. Association for Computational Linguis-tics.Roman Klinger and Philipp Cimiano. 2013. Bi-directional inter-dependencies of subjective expres-sions and targets and their value for a joint model.In

Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Volume2: Short Papers) , pages 848–854, Soﬁa, Bulgaria.Association for Computational Linguistics.Xin Li, Lidong Bing, Piji Li, and Wai Lam. 2019. Auniﬁed model for opinion target extraction and tar-get sentiment prediction. In

Proceedings the Thirty-Third AAAI Conference on Artiﬁcial Intelligence(AAAI 2019) , pages 6714–6721, Honolulu, Hawaii.AAAI Press.Bing Liu. 2015.

Sentiment analysis: Mining Opinions,Sentiments, and Emotions . Cambridge UniversityPress, Cambridge, United Kingdom.Edward Loper and Steven Bird. 2002. NLTK: The nat-ural language toolkit. In

Proceedings of the ACL-02Workshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Compu-tational Linguistics - Volume 1 , pages 63–70.Margaret Mitchell, Jacqui Aguilar, Theresa Wilson,and Benjamin Van Durme. 2013. Open domain tar-geted sentiment. In

Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1643–1654, Seattle, Washington,USA. Association for Computational Linguistics.aif M. Mohammad and Peter D. Turney. 2013.Crowdsourcing a Word-Emotion Association Lexi-con.

Computational Intelligence , 29(3):436–465.Maria Pontiki, Dimitris Galanis, John Pavlopoulos,Harris Papageorgiou, Ion Androutsopoulos, andSuresh Manandhar. 2014. SemEval-2014 task 4:Aspect based sentiment analysis. In

Proceedings ofthe 8th International Workshop on Semantic Evalua-tion (SemEval 2014) , pages 27–35, Dublin, Ireland.Association for Computational Linguistics.Alexander Rietzler, Sebastian Stabinger, Paul Opitz,and Stefan Engl. 2020. Adapt or get left behind:Domain adaptation through BERT language modelﬁnetuning for aspect-target sentiment classiﬁcation.In

Proceedings of The 12th Language Resourcesand Evaluation Conference , pages 4933–4941, Mar-seille, France. European Language Resources Asso-ciation.Maite Taboada, Caroline Anthony, and Kimberly Voll.2006. Methods for Creating Semantic OrientationDictionaries. In

Proceedings of the Fifth Inter-national Conference on Language Resources andEvaluation (LREC’06) , Genoa, Italy. European Lan-guage Resources Association (ELRA).Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, TingLiu, and Bing Qin. 2014. Learning sentiment-speciﬁc word embedding for twitter sentiment clas-siﬁcation. In

Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1555–1565, Bal-timore, Maryland. Association for ComputationalLinguistics.Cigdem Toprak, Niklas Jakob, and Iryna Gurevych.2010. Sentence and expression level annotation ofopinions in user-generated discourse. In

Proceed-ings of the 48th Annual Meeting of the Associationfor Computational Linguistics , pages 575–584, Up-psala, Sweden. Association for Computational Lin-guistics.Bo Wang, Maria Liakata, Arkaitz Zubiaga, and RobProcter. 2017. TDParse: Multi-target-speciﬁc sen-timent recognition on twitter. In

Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers , pages 483–493, Valencia, Spain. As-sociation for Computational Linguistics.Janyce Wiebe, Theresa Wilson, and Claire Cardie.2005. Annotating expressions of opinions and emo-tions in language.

Language Resources and Eval-uation (formerly Computers and the Humanities) ,39(2/3):164–210. Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERTpost-training for review reading comprehension andaspect-based sentiment analysis. In

Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers) , pages 2324–2335, Minneapolis,Minnesota. Association for Computational Linguis-tics.Meishan Zhang, Yue Zhang, and Duy-Tin Vo. 2015.Neural networks for open domain targeted senti-ment. In

Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing , pages 612–621, Lisbon, Portugal. Associationfor Computational Linguistics. ppendix

Appendix A Additional tables

PQA DS. Services DS. Unis Challenge Open SemEval R. SemEval L. TDParse o r i g i n a l [C LS ] 63.5 (2) (1) (4) (0) (1) (2) (1) (1) F IRST (2) (1) (4) (1) (2) (1) (1) (1) M EAN (2) (1) (4) (1) (3) (2) (1) (1) M AX (4) (1) (3) (1) (5) (2) (2) (3) M AX MM 59.3 (2) (1) (3) (1) (5) (1) (1) (5) P r e d i c t e d E xp r e ss i on s M P QA [C LS ] 60.3 (2) (1) (5) (1) (4) (2) (1) (2) F IRST (2) (2) (3) (1) (3) (1) (2) (1) M EAN (2) (1) (5) (1) (4) (2) (1) (2) M AX (2) (0) (4) (0) (1) (2) (1) (2) M AX MM 56.2 (5) (1) (2) (1) (3) (2) (1) (4) D S . S e r v i ce s [C LS ] 63.6 (1) (1) (1) (1) (4) (2) (1) (2) F IRST (2) (0) (3) (1) (9) (1) (1) (2) M EAN (2) (0) (5) (1) (3) (2) (1) (2) M AX (1) (1) (2) (1) (2) (1) (1) (3) M AX MM 58.5 (2) (1) (5) (1) (2) (2) (2) (4) D S . U n i s [C LS ] 63.1 (1) (1) (3) (1) (2) (3) (1) (3) F IRST (2) (1) (2) (1) (3) (3) (2) (1) M EAN (1) (1) (3) (1) (2) (3) (1) (3) M AX (4) (1) (1) (1) (1) (2) (1) (2) M AX MM 59.8 (3) (1) (4) (0) (4) (1) (1) (4) L e x i c on E xp r e ss i on s H u L i u [C LS ] 60.3 (2) (1) (3) (1) (3) (1) (1) (1) F IRST (2) (1) (1) (1) (3) (1) (1) (1) M EAN (2) (1) (3) (1) (3) (1) (1) (1) M AX (3) (1) (2) (1) (2) (2) (3) (3) M AX MM 60.8 (3) (1) (2) (1) (3) (3) (1) (4) N RC [C LS ] 64.0 (2) (1) (2) (1) (3) (1) (1) (1) F IRST (2) (1) (2) (1) (5) (3) (0) (2) M EAN (2) (1) (2) (1) (3) (1) (1) (1) M AX (3) (1) (2) (1) (1) (2) (2) (3) M AX MM 59.5 (3) (1) (4) (1) (2) (2) (1) (4) S o C a l [C LS ] 63.2 (2) (1) (4) (0) (4) (1) (1) (1) F IRST (2) (2) (4) (0) (6) (2) (1) (2) M EAN (2) (1) (4) (0) (4) (1) (1) (1) M AX (2) (1) (2) (1) (2) (4) (2) (4) M AX MM 59.7 (2) (2) (1) (1) (3) (3) (1) (2) S o C a l - G oog l e [C LS ] 62.6 (3) (1) (3) (1) (3) (1) (1) (1) F IRST (1) (1) (5) (1) (6) (2) (1) (0) M EAN (3) (1) (3) (1) (3) (1) (1) (1) M AX (3) (0) (5) (1) (3) (2) (2) (4) M AX MM 60.6 (4) (1) (3) (1) (3) (4) (2) (5)

Table 5: Macro F scores for polarity classiﬁcation of gold targets. Bold numbers indicate the best model perdataset, while blue and pink highlighting indicates an improvement or loss in performance compared to theoriginal data (gold targets only), respectively.

Unique % Overlaptrain dev test train-dev train-testMPQA

15 19DS. Services 36.2 48.6 47.5 45.0 35.6DS. Uni 35.2 52.9 45.0 Table 6: Analysis of targets in the datasets. % Unique describes the number of targets that are found only in thatsplit. % Overlap describes the percentage of dev/test targets that are found in the train set. We disregard partialmatches, e.g., “chinese food” and “food”.

MPQA DS. Services DS. Unis t r a i n e d MPQA 15.0 (1.7) (0.8) (1.2)

DS. Services 0.9 (0.3) (7.3) (1.2)

DS. Unis 1.4 (0.6) (1.5) (1.5) l e x i c on s HuLiu 4.7 17.9 16.0NRC 3.3 7.4 9.0SoCal 2.4 13.2 13.8SoCal Google 1.0 13.2 11.4

Table 7: Token-level macro F scores for expression prediction models (trained) and lexicon expressions (lexicons)when tested on the three ﬁne-grained datasets (x-axis). The trained model scores are the average and standarddeviation across ﬁve runs with different random seeds. The lexicon models are deterministic and therefore onlyhave a single score. ppendix B Training details GPU Infrastructure

CPU Infrastructure

Intel Xeon-Gold 6126 2.6 GHz

Number of search trials Domain training duration

Extraction ﬁne-tuning duration

Classiﬁcation ﬁne-tuning duration

Model implementation https://github.com/blinded/for/review