[PDF] Re-Evaluating GermEval17 Using German Pre-Trained Language Models

Abstract

The lack of a commonly used benchmark data set (collection) such as (Super-)GLUE (Wang et al., 2018, 2019) for the evaluation of non-English pre-trained language models is a severe shortcoming of current English-centric NLP-research. It concentrates a large part of the research on English, neglecting the uncertainty when transferring conclusions found for the English language to other languages. We evaluate the performance of the German and multilingual BERT-based models currently available via the huggingface transformers library on the four tasks of the GermEval17 workshop. We compare them to pre-BERT architectures (Wojatzki et al., 2017; Schmitt et al., 2018; Attia et al., 2018) as well as to an ELMo-based architecture (Biesialska et al., 2020) and a BERT-based approach (Guhr et al., 2020). The observed improvements are put in relation to those for similar tasks and similar models (pre-BERT vs. BERT-based) for the English language in order to draw tentative conclusions about whether the observed improvements are transferable to German or potentially other related languages.

Full PDF

aa r X i v : . [ c s . C L ] F e b Re-Evaluating GermEval17 Using German Pre-Trained Language Models

Matthias Aßenmacher Alessandra Corvonato Department of Statistics, Ludwig-Maximilians-Universit¨at, Munich, Germany { matthias,chris } @stat.uni-muenchen.de , [email protected] Christian Heumann Abstract

The lack of a commonly used benchmarkdata set (collection) such as (Super-)GLUE(Wang et al., 2018, 2019) for the evaluationof non-English pre-trained language modelsis a severe shortcoming of current English-centric NLP-research. It concentrates a largepart of the research on English, neglectingthe uncertainty when transferring conclusionsfound for the English language to other lan-guages. We evaluate the performance of theGerman and multilingual BERT-based mod-els currently available via the huggingface transformers library on the four tasks ofthe GermEval17 workshop. We compare themto pre-BERT architectures (Wojatzki et al.,2017; Schmitt et al., 2018; Attia et al., 2018)as well as to an ELMo-based architecture(Biesialska et al., 2020) and a BERT-based ap-proach (Guhr et al., 2020). The observed im-provements are put in relation to those for sim-ilar tasks and similar models (pre-BERT vs.BERT-based) for the English language in orderto draw tentative conclusions about whetherthe observed improvements are transferable toGerman or potentially other related languages.

Sentiment Analysis is used to transform reviewsinto helpful information on how a product orservice of a company is perceived among thecustomers. Until recently, Sentiment Analy-sis was mainly conducted using traditional ma-chine learning and recurrent neural networks,like LSTMs (Hochreiter and Schmidhuber, 1997)or GRUs (Cho et al., 2014). Those mod-els have been practically replaced with lan-guage models relying on (parts of) the Trans-former architecture, a novel framework pro-posed by Vaswani et al. (2017). Devlin et al.(2019) developed a Transformer-encoder-basedlanguage model called BERT ( B idirectional E ncoder R epresentations from T ransfomers),achieving state-of-the-art (SOTA) performance onseveral benchmark tasks - mainly for the Englishlanguage - and becoming a milestone in the ﬁeldof NLP.Up to now, only a few researchers have focusedon sentiment related problems for German reviews.The ﬁrst shared task on German Sentiment Ana-lysis, which provides a large annotated data setfor training and evaluation, is the GermEval17Shared Task (Wojatzki et al., 2017). The partici-pating teams mostly analyzed the data using stan-dard machine learning techniques such as SVMs,CRFs, or LSTMs. In contrast to 2017, today, dif-ferent pre-trained BERT models are available fora variety of different languages, including Ger-man. We re-analyzed the complete GermEval17Task using seven pre-trained BERT models suit-able for German provided by the huggingface transformers library (Wolf et al., 2020). Wewill evaluate which one of the models is bestsuited for the different GermEval17 subtasks bycomparing their performance values. Furthermore,we compare our ﬁndings on whether (and howmuch) BERT-based models are able to improvethe pre-BERT SOTA in German Sentiment Anal-ysis with the SOTA developments for English Sen-timent Analysis by the example of SemEval-2014(Pontiki et al., 2014).We will ﬁrst give an overview on theGermEval17 tasks (cf. Sec. 2) and on related work(cf. Sec. 3). Second, we present the data and themodels (cf. Sec. 4), while Section 5 presents theresults of our re-evaluation. Sections 6 and 7 con-clude our work by stating our main ﬁndings anddrawing parallels to the English language.

The GermEval17 Task(s)

The GermEval17 Shared Task (Wojatzki et al.,2017) is a task on analyzing aspect-based senti-ments in customer reviews about ”Deutsche Bahn”(DB) - the German public train company. Themain data was crawled from various social me-dia platforms such as Twitter, Facebook and Q&Awebsites from May 2015 to June 2016. The doc-uments were manually annotated, and split into atraining ( train ), a development ( dev ) and a syn-chronic ( test syn ) test set. A diachronic test set( test dia ) was collected the same way from Novem-ber 2016 to January 2017 in order to test forrobustness. The task comprises four subtasksrepresenting a complete classiﬁcation pipeline.Subtask A is a binary Relevance Classiﬁcationtask which aims at identifying whether the feed-back refers to DB. Subtask B aims at classify-ing the Document-level Polarity (”negative”, ”pos-itive” and ”neutral”). In Subtask C, the modelhas to identify all the aspect categories with as-sociated sentiment polarities in a relevant docu-ment. This multi-label classiﬁcation task wasdivided into Subtask C1 (

Aspect-only ) and Sub-task C2 (

Aspect+Sentiment ). For this purpose,the organizers deﬁned 20 different aspect cate-gories, e.g.

Allgemein ( General ), SonstigeUnregelm¨aßigkeiten ( Other irregularities ).Finally, Subtask D refers to the Opinion Target Ex-traction (OTE), i.e. a sequence labeling task ex-tracting the linguistic phrase used to express anopinion. We differentiate between exact match(Subtask D1) and overlapping match, tolerating er-rors of + / − one token (Subtask D2). Already before BERT, many researchers focusedon (English) Sentiment Analysis (Behdenna et al.,2018)). The most common architectures were tra-ditional machine learning classiﬁers and recurrentneural networks (RNNs). SemEval14 (Task 4;Pontiki et al., 2014) was the ﬁrst workshop to in-troduce Aspect-based Sentiment Analysis (ABSA)which was expanded within SemEval15 Task 12(Pontiki et al., 2015) and SemEval16 Task 5(Pontiki et al., 2016). Here, restaurant and lap-top reviews were examined on different granu-larities. The best model at SemEval16 was anSVM/CRF architecture using GloVe embeddings(Pennington et al., 2014). However, many worksrecently focused on re-evaluating the SemEval Sentiment Analysis task using BERT-based lan-guage models (Hoang et al., 2019; Xu et al., 2019;Sun et al., 2019; Li et al., 2019; Karimi et al.,2020; Tao and Fang, 2020).In comparison, little research deals withGerman Sentiment Analysis. For instance,Barriere and Balahur (2020) trained a multi-lingual BERT model for German Document-level Sentiment Analysis on the SB-10k dataset (Cieliebak et al., 2017). Regarding theGermEval17 Subtask B, Guhr et al. (2020) con-sidered both FastText (Bojanowski et al., 2017)and BERT, achieving notable improvements.Biesialska et al. (2020) made use of ensem-ble models. One is an ensemble of ELMo(Peters et al., 2018), GloVe and a bi-attentive clas-siﬁcation network (BCN; McCann et al., 2017),achieving a score of 0.782, and the other oneconsists of ELMo and a Transformer-based Sen-timent Analysis model (TSA), reaching a score of0.789 for the synchronic test data set. Moreover,Attia et al. (2018) trained a convolutional neuralnetwork (CNN), achieving a score of 0.7545 onthe synchronic test data set. Schmitt et al. (2018)advanced the SOTA for Subtask C. They exper-imented with biLSTMs and CNNs to carry outend-to-end Aspect-based Sentiment Analysis. Thehighest score was achieved using an end-to-endCNN architecture with FastText embeddings, scor-ing 0.523 and 0.557 on the synchronic and di-achronic test data set for Subtask C1, respectively,and 0.423 and 0.465 for Subtask C2.

Data

The GermEval17 data is freely available in .xml - and .tsv -format . Each data split (train,validation, test) in .tsv -format contains the fol-lowing variables:• document id (URL)• document text• relevance label ( true , false )• document-level sentiment label ( negative , neutral , positive )• aspects with respective polarities (e.g. Ticketkauf ) The data sets (in both formats) can be obtained fromhttp://ltdata1.informatik.uni-hamburg.de/germeval2017/. or documents which are annotated as irrele-vant, the sentiment label is set to neutral and noaspects are available. Visibly, the .tsv -formatteddata does not contain the target expressions ortheir associated sequence positions. Consequently,Subtask D can only be conducted using the data in .xml -format which additionally holds the infor-mation on the starting and ending sequence posi-tions of the target phrases.The data set is comprised of ∼ k documentsin total, including the diachronic test set witharound . k examples. Further, the main data wasrandomly split by the organizers into a train dataset for training, a development data set for evalu-ation and a synchronic test data set. Table 1 dis-plays the number of documents for each split. train dev test syn test dia Table 1: Number of documents per split of the data set.

While roughly 74% of the documents form thetrain set, the development split and the synchronictest split contains around 9% and around 10%, re-spectively. The remaining 7% of the data belong tothe diachronic set (cf. Tab. 1). Table 2 shows therelevance distribution per data split. This unveils apretty skewed distribution of the labels since therelevant documents represent the clear majoritywith over 80% in each split.

Relevance train dev test syn test dia true 16,201 1,931 2,095 1,547false 3,231 438 471 295

Table 2: Relevance distribution for Subtask A.

The distribution of the sentiments is depicted inTable 3, which shows that between 65% and 69%(per split) belong to the neutral class, 25–31% tothe negative and only 4–6% to the positive class.Table 4 holds the distribution of the 20 differentaspect categories assigned to the documents . Itshows the number of documents containing certaincategories without differentiating between how of-ten a category appears within a given document.The relative distribution of the aspect categoriesis similar between the splits. On average, there Multiple annotations per document are pos-sible; for a detailed category description seehttps://sites.google.com/view/germeval2017-absa/data.

Sentiment train dev test syn test dia negative 5,045 589 780 497neutral 13,208 1,632 1,681 1,237positive 1,179 148 105 108

Table 3: Sentiment distribution for Subtask B.

Category train dev test syn test dia

Allgemein 11,454 1,391 1,398 1,024Zugfahrt 1,687 177 241 184Sonstige Unregelm¨aßigkeiten 1,277 139 224 164Atmosph¨are 990 128 148 53Ticketkauf 540 64 95 48Service und Kundenbetreuung 447 42 63 27Sicherheit 405 59 84 42Informationen 306 28 58 35Connectivity 250 22 36 73Auslastung und Platzangebot 231 25 35 20DB App und Website 175 20 28 18Komfort und Ausstattung 125 18 24 11Barrierefreiheit 53 14 9 2Image 42 6 0 3Toiletten 41 5 7 4Gastronomisches Angebot 38 2 3 3Reisen mit Kindern 35 3 7 2Design 29 3 4 2Gep¨ack 12 2 2 6QR-Code 0 1 1 0total 18,137 2,149 2,467 1,721 ∅ different aspects/document 1.12 1.11 1.18 1.11 Table 4: Aspect category distribution for Subtask C.Multiple mentions of the same aspect category in a doc-ument are only considered once. are ∼ . different aspects per document. Again,the label distribution is heavily skewed, with Allgemein ( General ) clearly representing themajority class, as it is present in 75.8% of the doc-uments with aspects. The second most frequentcategory is

Zugfahrt ( Train ride ) appearingin around 13.8% of the documents. This strongimbalance in the aspect categories leads to analmost Zipﬁan distribution (Wojatzki et al., 2017).

Pre-trained architectures

Since we were in-terested in examining the performance differ-ences between pre-BERT architectures and currentSOTA models, we took pre-trained German BERT-based architectures available via the huggingface transformers module (Wolf et al., 2020). Allthe models we selected for examination are dis-played in Table 5.We include three German (Distil)BERT models odel variant corpus properties bert-base-german-cased bert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased

Largest Wikipedias (top 104 languages) L=12, H=768, A=12, 179M parameters bert-base-multilingual-uncased

Largest Wikipedias (top 102 languages) L=12, H=768, A=12, 168M parameters distilbert-base-german-cased distilbert-base-multilingual-cased

Largest Wikipedias (top 104 languages) L=6, H=768, A=12, 134M parameters

Table 5: Pre-trained models provided by huggingface transformers (version 4.0.1) suitable for German. For all available models, see: https://huggingface.co/transformers/pretrained_models.html . by DBMDZ and one by Deepset.ai . The latterone is pre-trained using German Wikipedia (6GB raw text ﬁles), the

Open Legal Data dump (2.4GB; Ostendorff et al., 2020) and newsarticles (3.6GB). DBMDZ combine

Wikipedia , EU Bookshop (Skadin¸ ˇs et al., 2014),

OpenSubtitles (Lison and Tiedemann, 2016),

CommonCrawl (Ortiz Su´arez et al., 2019),

ParaCrawl (Espl`a-Gomis et al., 2019) and

News Crawl (Haddow, 2018) to a corpus witha total size of 16GB with ∼ , M tokens.Besides this, we use the three multilingual modelsincluded in the transformers module. Thisamounts to ﬁve BERT and two DistilBERT mod-els, two of which are ”uncased” (i.e. lower-cased)while the other ﬁve models are ”cased” ones.

For the re-evaluation, we used the latest dataprovided in .xml -format. Duplicates were notremoved, in order to make our results as com-parable as possible. We tokenized the docu-ments and ﬁxed single spelling mistakes in the la-bels . For Subtask D, the BIO-tags were addedbased on the provided sequence positions, i.e.one entity corresponds to at least one token tagstarting with B- for ”Beginning” and continuingwith I- for ”Inner”. If a token does not be-long to any entity, the tag O for ”Outer” is as-signed. For instance, the sequence ”f¨ahrt nicht”(engl. ”does not run”) consists of two tokens andwould receive the entity Zugfahrt:negative and the token tags [

B-Zugfahrt:negative , I-Zugfahrt:negative ] if it refers to a DBtrain which is not running. Visit https://deepset.ai/german-bert for details. ”positve” in train set was replaced with ”positive”, ” neg-ative” in test dia was replaced with ”negative”. The models were ﬁne-tuned on one Tesla V100PCIe 16GB GPU using Python 3.8.7. Moreover,the transformers module (version 4.0.1) and torch (version 1.7.1) were used . The con-sidered values for the hyperparameters for ﬁne-tuning follow the recommendations of Devlin et al.(2019) :• Batch size ∈ { , } ,• Adam learning rate ∈ { } − ,• ∈ { , , } .Finally, all the language models were ﬁne-tunedwith a learning rate of 5e-5 and four epochs. Themaximum sequence length was set to 256 and abatch size of 32 was chosen. Other models

Eight teams ofﬁcially partici-pated in GermEval17, of which ﬁve teams ana-lyzed Subtask A, all of them evaluated Subtask B,and two teams submitted results for both Sub-task C and D. Since GermEval17 was conductedback in 2017, several other authors have also fo-cused on analyzing parts of the subtasks. Table6 shows which authors employed which kinds ofmodels to solve which task. We added the systemby Ruppert et al. (2017) to the models from 2017even if they did not ofﬁcially participate becausethe system was constructed and released by the or-ganizers. They also tackled all the subtasks.The performance values of the models are addedto our tables when suitable. Source code is available on GitHub:https://github.com/ac74/reevaluating germeval2017. Theresults are fully reproducible for Subtasks A, B and C. ForSubtask D, the results should also be reproducible but forsome reason they are not, and we could not yet ﬁx theissue. However, the micro F1 scores only ﬂuctuate between+/-0.01. Due to memory limitations, not every hyperparametercombination was applicable. ubtask

A B C1 C2 D1 D2Models 2017 (Wojatzki et al., 2017; Ruppert et al., 2017) X X X X X XOur BERT models X X X X X XCNN (Attia et al., 2018) – X – – – –CNN+FastText (Schmitt et al., 2018) – – X X – –ELMo+GloVe+BCN (Biesialska et al., 2020) – X – – – –ELMo+TSA (Biesialska et al., 2020) – X – – – –FastText (Guhr et al., 2020) – X – – – – bert-base-german-cased (Guhr et al., 2020) – X – – – –

Table 6: Different subtasks which the models were(not) evaluated on.

Subtask A

The Relevance Classiﬁcation is abinary document classiﬁcation task with classes true and false . Table 7 displays the microF1 score obtained by each language model on eachtest set (best result per data set in bold).

Language model test syn test dia

Best model 2017 (Sayyed et al., 2017) 0.903 0.906 bert-base-german-cased bert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased

Table 7: F1 scores for Subtask A on synchronic anddiachronic test sets.

All the models outperform the best result achievedin 2017 for both test data sets. For the synchronictest set, the previous best result is surpassed by3.8–5.4 percentage points. For the diachronic testset, the absolute difference to the best contenderof 2017 varies between 2.6 and 4.2 percentagepoints. With a micro F1 score of 0.957 and 0.948,respectively, the best scoring pre-trained languagemodel is the uncased German BERT-BASE vari-ant by dbmdz , followed by its cased version. Allthe pre-trained models perform slightly better onthe synchronic test data than on the diachronicdata. Attia et al. (2018), Schmitt et al. (2018),Biesialska et al. (2020) and Guhr et al. (2020) didnot evaluate their models on this task.

Subtask B

Subtask B refers to the Document-level Polarity, which is a multi-class classiﬁcationtask with three classes. Table 8 demonstrates theperformances on the two test sets:All models outperform the best model from 2017by 1.0–4.0 percentage points for the synchronic,and by 1.6–5.0 percentage points for the di-achronic test set. On the synchronic test set, theuncased German BERT-BASE model by dbmdz performs best with a score of 0.807, followed by

Language model test syn test dia

Best models 2017 ( test syn : Ruppert et al., 2017) 0.767 0.750( test dia : Sayyed et al., 2017) bert-base-german-cased bert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased † – bert-base-german-cased (Guhr et al., 2020) 0.789 † – Table 8: Micro-averaged F1 scores for Subtask B onsynchronic and diachronic test sets. † Guhr et al. (2020) created their own (balanced & un-balanced) data splits, which limits comparability. Wecompare to the performance on the unbalanced datasince it more likely resembles the original data splits. its cased variant with 0.799. For the diachronictest set, the uncased German BERT-BASE modelexceeds the other models with a score of 0.800, fol-lowed by the cased German BERT-BASE modelreaching a score of 0.793. The three multilingualmodels perform generally worse than the Germanmodels on this task. Besides this, all the modelsperform slightly better on the synchronic data setthan on the diachronic one. The FastText-basedmodel (Guhr et al., 2020) comes not even closeto the baseline from 2017, while the ELMo-basedmodels (Biesialska et al., 2020) are pretty compet-itive. Interestingly, two of the multilingual modelsare even outperformed by these ELMo-based mod-els.

Subtask C

Subtask C is split into

Aspect-only (Subtask C1) and

Aspect+Sentiment

Classiﬁcation(Subtask C2), each being a multi-label classiﬁca-tion task . As the organizers provide 20 aspectcategories, Subtask C1 includes 20 labels, whereasSubtask C2 deems 60 labels since each aspect cat-egory can be combined with each of the threesentiments. Consistent with Lee et al. (2017) andMishra et al. (2017), we do not account for multi-ple mentions of the same label in one document.The results for Subtask C1 are shown in Table 9:All pre-trained German BERTs clearly surpass thebest performance from 2017 as well as the re-sults reported by Schmitt et al. (2018), who arethe only ones of the other authors to evaluate their This leads to a change of activation functions in the ﬁnallayer from softmax to sigmoid + binary cross entropy loss. anguage model test syn test dia

Best model 2017 (Ruppert et al., 2017) 0.537 0.556 bert-base-german-cased bert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased

Table 9: Micro-averaged F1 scores for Subtask C1(

Aspect-only ) on synchronic and diachronic test sets. Amore detailed overview of the per-class performancescan be found in Table 15 in Appendix A. models on this tasks. Regarding the synchronictest set, the absolute improvement ranges between16.9 and 22.4 percentage points, while for the di-achronic test data, the models outperform the pre-vious results by 17.8–23.5 percentage points. Thebest model is again the uncased German BERT-BASE model by dbmdz , reaching scores of 0.761and 0.791, respectively, followed by the two casedGerman BERT-BASE models. One more time,the multilingual models exhibit the poorest perfor-mances amongst the evaluated models. Next, Ta-ble 10 shows the results for Subtask C2:

Language model test syn test dia

Best model 2017 (Ruppert et al., 2017) 0.396 0.424 bert-base-german-cased bert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased

Table 10: Micro-averaged F1 scores for Subtask C2(

Aspect+Sentiment ) on synchronic and diachronic testsets. A more detailed overview of the per-class perfor-mances can be found in Table 16 in Appendix A.

Here, the pre-trained models surpass the bestmodel from 2017 by 15.7–25.9 percentage pointsand 20.7–26.5 percentage points, respectively, forthe synchronic and diachronic test sets. Again, thebest model is the uncased German BERT-BASE dbmdz model reaching a score of 0.655 and 0.689,respectively. The CNN models (Schmitt et al.,2018) are also outperformed. For both Subtask C1and C2, all the displayed models perform better onthe diachronic than on the synchronic test data.

Subtask D

Subtask D refers to the OpinionTarget Extraction (OTE) and is thus a token-level classiﬁcation task. As this is a ratherdifﬁcult task, Wojatzki et al. (2017) distinguishbetween exact (Subtask D1) and overlappingmatch (Subtask D2), tolerating a deviation of + / − one token. Here, ”entities” are identiﬁedby their BIO-tags. It is noteworthy that thereare less entities here than for Subtask C sincedocument-level aspects or sentiments could notalways be assigned to a certain sequence in thedocument. As a result, there are less documentsat disposal for this task, namely 9,193. Theremaining data has 1.86 opinions per documenton average. The majority class is now SonstigeUnregelm¨aßigkeiten:negative witharound 15.4% of the true entities (16,650 in total),leading to more balanced data than in Subtask C.

Language model test syn test dia

Best model 2017 (Ruppert et al., 2017) 0.229 0.301 bert-base-german-cased w it hou t CR F bert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased bert-base-german-cased bert-base-german-dbmdz-cased w it h CR F bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased Table 11: Entity-level micro-averaged F1 scores forSubtask D1 ( exact match ) on synchronic and di-achronic test sets. A more detailed overview of the per-class performances can be found in Table 17 in Ap-pendix B.

In Table 11, we compare the pre-trained modelsusing an ”ordinary” softmax layer to when using aCRF layer for Subtask D1.The best performing model is the uncased Ger-man BERT-BASE model by dbmdz with CRFlayer on both test sets, with a score of 0.515 and0.518, respectively. Overall, the results from 2017are outperformed by 11.8–28.6 percentage pointson the synchronic test set and 5.6–21.7 percentagepoints on the diachronic test set.For the overlapping match (cf. Tab. 12), the bestsystem from 2017 are outperformed by 4.9–17.5percentage points on the synchronic and by 4.2–16.8 percentage points on the diachronic test set.Again, the uncased German BERT-BASE model anguage model test syn test dia

Best models 2017 ( test syn : Lee et al., 2017) 0.348 0.365( test dia : Ruppert et al., 2017) bert-base-german-cased w it hou t CR F bert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased bert-base-german-cased bert-base-german-dbmdz-cased w it h CR F bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased distilbert-base-multilingual-cased Table 12: Entity-level micro-averaged F1 scores forSubtask D2 ( overlapping match ) on synchronic and di-achronic test sets. A more detailed overview of the per-class performances can be found in Table 18 in Ap-pendix B. by dbmdz with CRF layer performs best with anmicro F1 score of 0.523 on the synchronic and0.533 on the diachronic set. To our knowledge,there were no other models to compare our perfor-mance values with, besides the results from 2017. The most widely adopted data sets for EnglishAspect-based Sentiment Analysis originate fromthe SemEval Shared Tasks (Pontiki et al., 2014,2015, 2016). When looking at public leaderboards,e.g. https://paperswithcode.com/, Subtask SB2(aspect term polarity) from SemEval-2014 is thetask which attracts most of the researchers. A com-parison of pre-BERT and BERT-based methods re-veals no big ”jump” in the performance values, butrather a steady increase over time (cf. Tab. 13).More related, but unfortunately also less used,to Subtasks C1 and C2 from GermEval17 arethe subtasks SB3 (aspect category extraction) andSB4 (aspect category polarity) from SemEval-2014. Since the data sets (

Restaurants and

Lap-tops ) have been further developed for SemEval-2015 and SemEval-2016, subtasks SB3 and SB4are revisited under the names Slot 1 and Slot 3for the in-domain ABSA in SemEval-2015. Slot2 from SemEval-2015 aims at OTE and thus cor-responds to Subtask D from GermEval17. ForSemEval-2016 the same task names as in 2015were used, subdivided into Subtask 1 ( sentence-level ABSA ) and Subtask 2 ( text-level ABSA) .Table 14 shows the performance of the best

Language model

Laptops RestaurantsBest model SemEval-2014 0.7048 0.8095 p r e - B E R T (Pontiki et al., 2014)MemNet (Tang et al., 2016) 0.7221 0.8095HAPN (Li et al., 2018) 0.7727 0.8223BERT-SPC (Song et al., 2019) 0.7899 0.8446 B E R T - b a s e d BERT-ADA (Rietzler et al., 2020) 0.8023 0.8789LCF-ATEPC (Yang et al., 2019) 0.8229 0.9018

Table 13: Development of the SOTA Accuracyfor the aspect term polarity task (SemEval-2014;Pontiki et al., 2014). Selected models were pickedfrom https://paperswithcode.com/sota/aspect-based-sentiment-analysis-on-semeval . model from 2014 as well as performance valuesfor subsequent (pre-BERT and BERT-based) archi-tectures for subtasks SB3 and SB4. Restaurants

Language model

SB3 SB4 p r e - B E R T Best model SemEval-2014 0.8857 0.8292(Pontiki et al., 2014)ATAE-LSTM (Wang et al., 2016) —- 0.840BERT-pair (Sun et al., 2019) 0.9218 0.899 B E R T - b a s e d CG-BERT (Wu and Ong, 2020) 0.9162 † † QACG-BERT (Wu and Ong, 2020) 0.9264 0.904 † Table 14: Development of the SOTA F1 score (SB3)and Accuracy (SB4) for the aspect category extrac-tion/polarity task (SemEval-2014; Pontiki et al., 2014). † Wu and Ong (2020) additionally used auxiliary sen-tences here.

In contrast to what we observed for subtask SB2,in this case, the performance increase on SB4caused by the introduction of BERT seems tobe kind of striking. While the ATAE-LSTM(Wang et al., 2016) only slightly increased theperformance compared to 2014, the BERT-basedmodels led to a jump of more than 6 percentagepoints.Nevertheless, the comparability to our exper-iments in the GermEval17 data is somewhatlimited. Subtask SB4 only exhibits ﬁve as-pect categories (as opposed to 20 categories forGermEval17) which leads to an easier classiﬁca-tion problem and is reﬂected in the already prettyigh scores of the 2014 baselines. Another is-sue is that while for improving the SOTA on theSemEval-2014 tasks (partly) highly specialized(T)ABSA architectures were used, we ”only” ap-plied standard pre-trained German BERT modelswithout any task-speciﬁc modiﬁcations or exten-sions. This leaves room for further improvementson this task on German data which should be anobjective for future research.

As expected, all the pre-trained language mod-els clearly outperform all the models from 2017,proving once more the power of transfer learn-ing. Throughout the presented analyses, the mod-els always achieve similar results between the syn-chronic and the diachronic test sets, indicatingtemporal robustness for the models. Nonetheless,the diachronic data was collected only half a yearafter the main data. It would be interesting tosee whether the trained models would return simi-lar predictions on data collected a couple of yearslater.The uncased German BERT-BASE model by dbmdz achieves the best results across all sub-tasks. Since R ¨onnqvist et al. (2019) showed thatmonolingual BERT models often outperform themultilingual models for a variety of tasks, onemight have already suspected that a monolingualGerman BERT performs best across the performedtasks. It may not seem evident at ﬁrst that anuncased language model ends up as the best per-forming model since, e.g. in Sentiment Analy-sis, capitalized letters might be an indicator forpolarity. In addition, since nouns and beginningsof sentences always start with a capital letter inGerman, one might assume that lower-casing thewhole text changes the meaning of some wordsand thus confuses the language model. Neverthe-less, the GermEval17 documents are very noisysince they were retrieved from social media. Thatmeans that the data contains many misspellings,grammar and expression mistakes, dialect, and col-loquial language. For this reason, already someparticipating teams in 2017 pursued an elaboratepre-processing on the text data in order to elimi-nate some noise (H ¨ovelmann and Friedrich, 2017;Sayyed et al., 2017; Sidarenka, 2017). Amongother things, H ¨ovelmann and Friedrich (2017)transformed the text to lower-case and replaced,for example, ”S-Bahn” and ”S Bahn” with ”sbahn”. We suppose that in this case, lower-casing the texts improves the data quality by elim-inating some of the noise and acts as a sort of reg-ularization. As a result, the uncased models poten-tially generalize better than the cased models. Theﬁndings from Mayhew et al. (2019), who comparecased and uncased pre-trained models on socialmedia data for NER, corroborate this hypothesis.

References

Attia, M., Samih, Y., Elkahky, A., and Kallmeyer, L.(2018). Multilingual multi-class sentiment classi-ﬁcation using convolutional neural networks. In

Proceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018) , Miyazaki, Japan. European Language Re-sources Association (ELRA).Barriere, V. and Balahur, A. (2020). Improving sen-timent analysis over non-English tweets using mul-tilingual transformers and automatic translation fordata-augmentation. In

Proceedings of the 28th Inter-national Conference on Computational Linguistics ,pages 266–271, Barcelona, Spain (Online). Interna-tional Committee on Computational Linguistics.Behdenna, S., Barigou, F., and Belalem, G. (2018).Document level sentiment analysis: A survey.

EAIEndorsed Transactions on Context-aware Systemsand Applications , 4:154339.Biesialska, K., Biesialska, M., and Rybinski, H. (2020).Sentiment analysis with contextual embeddings andself-attention. arXiv preprint arXiv:2003.05574 .Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.(2017). Enriching word vectors with subword infor-mation.

Transactions of the Association for Compu-tational Linguistics , 5:135–146.Cho, K., Van Merri¨enboer, B., Gulcehre, C., Bah-danau, D., Bougares, F., Schwenk, H., and Bengio,Y. (2014). Learning phrase representations using rnnencoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 .Cieliebak, M., Deriu, J. M., Egger, D., and Uzdilli, F.(2017). A Twitter corpus and benchmark resourcesfor German sentiment analysis. In

Proceedings ofthe Fifth International Workshop on Natural Lan-guage Processing for Social Media , pages 45–51,Valencia, Spain. Association for Computational Lin-guistics.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.(2019). BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. In

Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 4171–4186,inneapolis, Minnesota. Association for Computa-tional Linguistics.Espl`a-Gomis, M., Forcada, M., Ram´ırez-S´anchez, G.,and Hoang, H. T. (2019). ParaCrawl: Web-scaleparallel corpora for the languages of the EU. In

MT-Summit .Guhr, O., Schumann, A.-K., Bahrmann, F., and B¨ohme,H.-J. (2020). Training a Broad-Coverage Ger-man Sentiment Classiﬁcation Model for Dialog Sys-tems. In

Proceedings of the 12th Conference onLanguage Resources and Evaluation (LREC 2020) ,pages 1627–1632, Marseille, France.Haddow, B. (2018). News Crawl Corpus.Hoang, M., Bihorac, O. A., and Rouces, J. (2019).Aspect-based sentiment analysis using BERT. In

Proceedings of the 22nd Nordic Conference on Com-putational Linguistics , pages 187–196, Turku, Fin-land. Link¨oping University Electronic Press.Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.

Neural computation , 9(8):1735–1780.H¨ovelmann, L. and Friedrich, C. M. (2017). Fast-text and Gradient Boosted Trees at GermEval-2017Tasks on Relevance Classiﬁcation and Document-level Polarity. In

Proceedings of the GermEval 2017– Shared Task on Aspect-based Sentiment in SocialMedia Customer Feedback , Berlin, Germany.Karimi, A., Rossi, L., and Prati, A. (2020). Adversar-ial training for aspect-based sentiment analysis withbert.Lee, J.-U., Eger, S., Daxenberger, J., and Gurevych,I. (2017). UKP TU-DA at GermEval 2017: DeepLearning for Aspect Based Sentiment Detection. In

Proceedings of the GermEval 2017 – Shared Task onAspect-based Sentiment in Social Media CustomerFeedback , Berlin, Germany.Li, L., Liu, Y., and Zhou, A. (2018). Hierarchical atten-tion based position-aware network for aspect-levelsentiment analysis. In

Proceedings of the 22nd Con-ference on Computational Natural Language Learn-ing , pages 181–189, Brussels, Belgium. Associationfor Computational Linguistics.Li, X., Bing, L., Zhang, W., and Lam, W. (2019).Exploiting BERT for end-to-end aspect-based senti-ment analysis. In

Proceedings of the 5th Workshopon Noisy User-generated Text (W-NUT 2019) , pages34–41, Hong Kong, China. Association for Compu-tational Linguistics.Lison, P. and Tiedemann, J. (2016). OpenSubti-tles2016: Extracting Large Parallel Corpora fromMovie and TV Subtitles. In

Proceedings of the 10thInternational Conference on Language Resourcesand Evaluation (LREC 2016) . Mayhew, S., Tsygankova, T., and Roth, D. (2019). nerand pos when nothing is capitalized. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processingand the 9th Interna-tional Joint Conference on Natural Language Pro-cessing , pages 6256–6261, Hong Kong, China. As-sociation for Computational Linguistics.McCann, B., Bradbury, J., Xiong, C., and Socher, R.(2017). Learned in translation: Contextualized wordvectors. In Guyon, I., Luxburg, U. V., Bengio, S.,Wallach, H., Fergus, R., Vishwanathan, S., and Gar-nett, R., editors,

Advances in Neural InformationProcessing Systems , volume 30, pages 6294–6305.Curran Associates, Inc.Mishra, P., Mujadia, V., and Lanka, S. (2017).GermEval 2017: Sequence based Models for Cus-tomer Feedback Analysis. In

Proceedings of theGermEval 2017 – Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback , Berlin,Germany.Ortiz Su´arez, P. J., Sagot, B., and Romary, L. (2019).Asynchronous Pipeline for Processing Huge Cor-pora on Medium to Low Resource Infrastructures.In Ba´nski, P., Barbaresi, A., Biber, H., Breiteneder,E., Clematide, S., Kupietz, M., L¨ungen, H., and Il-iadi, C., editors, , Cardiff, United Kingdom. Leibniz-Institut f¨urDeutsche Sprache.Ostendorff, M., Blume, T., and Ostendorff, S. (2020).Towards an Open Platform for Legal Information. In

Proceedings of the ACM/IEEE Joint Conference onDigital Libraries in 2020 , JCDL ’20, pages 385—-388, New York, NY, USA. Association for Comput-ing Machinery.Pennington, J., Socher, R., and Manning, C. (2014).Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empiricalmethods in natural language processing (EMNLP) ,pages 1532–1543.Peters, M. E., Neumann, M., Iyyer, M., Gardner, M.,Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deepcontextualized word representations. arXiv preprintarXiv:1802.05365 .Pontiki, M., Galanis, D., Papageorgiou, H., Androut-sopoulos, I., Manandhar, S., AL-Smadi, M., Al-Ayyoub, M., Zhao, Y., Qin, B., de clercq, O., Hoste,V., Apidianaki, M., Tannier, X., Loukachevitch, N.,Kotelnikov, E., Bel, N., Zafra, S. M., and Eryi˘git, G.(2016). Semeval-2016 task 5: Aspect based senti-ment analysis. In

Proceedings of the 10th Interna-tional Workshop on Semantic Evaluation (SemEval-2016) , pages 19–30.Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar,S., and Androutsopoulos, I. (2015). SemEval-2015task 12: Aspect based sentiment analysis. In

Pro-ceedings of the 9th International Workshop on Se-mantic Evaluation (SemEval 2015) , pages 486–495,enver, Colorado. Association for ComputationalLinguistics.Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou,H., Androutsopoulos, I., and Manandhar, S. (2014).SemEval-2014 task 4: Aspect based sentiment anal-ysis. In

Proceedings of the 8th International Work-shop on Semantic Evaluation (SemEval 2014) , pages27–35, Dublin, Ireland. Association for Computa-tional Linguistics.Rietzler, A., Stabinger, S., Opitz, P., and Engl, S.(2020). Adapt or get left behind: Domain adaptationthrough BERT language model ﬁnetuning for aspect-target sentiment classiﬁcation. In

Proceedings ofthe 12th Language Resources and Evaluation Con-ference , pages 4933–4941, Marseille, France. Euro-pean Language Resources Association.R¨onnqvist, S., Kanerva, J., Salakoski, T., and Ginter, F.(2019). Is Multilingual BERT Fluent in LanguageGeneration? In

Proceedings of the First NLPLWorkshop on Deep Learning for Natural LanguageProcessing , pages 29–36, Turku, Finland. Link¨opingUniversity Electronic Press.Ruppert, E., Kumar, A., and Biemann, C. (2017).LT-ABSA: An Extensible Open-Source System forDocument-Level and Aspect-Based Sentiment Anal-ysis. In

Proceedings of the GermEval 2017 – SharedTask on Aspect-based Sentiment in Social MediaCustomer Feedback , Berlin, Germany.Sayyed, Z. A., Dakota, D., and K¨ubler, S. (2017). IDS-IUCL: Investigating Feature Selection and Oversam-pling for GermEval 2017. In

Proceedings of theGermEval 2017 – Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback , Berlin,Germany.Schmitt, M., Steinheber, S., Schreiber, K., and Roth,B. (2018). Joint aspect and polarity classiﬁcationfor aspect-based sentiment analysis with end-to-endneural networks. In

Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1109–1114, Brussels, Belgium.Association for Computational Linguistics.Sidarenka, U. (2017). PotTS at GermEval-2017 TaskB: Document-Level Polarity Detection Using Hand-Crafted SVM and Deep Bidirectional LSTM Net-work. In

Proceedings of the GermEval 2017 –Shared Task on Aspect-based Sentiment in SocialMedia Customer Feedback , Berlin, Germany.Skadin¸ ˇs, R., Tiedemann, J., Rozis, R., and Deksne,D. (2014). Billions of Parallel Words for Free:Building and Using the EU Bookshop Corpus. In

Proceedings of the 9th International Conference onLanguage Resources and Evaluation (LREC 2014) ,pages 1850–1855, Reykjavik, Iceland. EuropeanLanguage Resources Association (ELRA).Song, Y., Wang, J., Jiang, T., Liu, Z., and Rao,Y. (2019). Attentional encoder network for tar-geted sentiment classiﬁcation. arXiv preprintarXiv:1902.09314 . Sun, C., Huang, L., and Qiu, X. (2019). UtilizingBERT for aspect-based sentiment analysis via con-structing auxiliary sentence. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 380–385, Minneapolis, Min-nesota. Association for Computational Linguistics.Tang, D., Qin, B., and Liu, T. (2016). Aspect levelsentiment classiﬁcation with deep memory network. arXiv preprint arXiv:1605.08900 .Tao, J. and Fang, X. (2020). Toward multi-label sen-timent analysis: a transfer learning based approach.

Journal of Big Data , 7:1.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin,I. (2017). Attention Is All You Need. In , Long Beach, California, USA.Wang, A., Pruksachatkun, Y., Nangia, N., Singh,A., Michael, J., Hill, F., Levy, O., and Bowman,S. (2019). Superglue: A stickier benchmark forgeneral-purpose language understanding systems.In

Advances in neural information processing sys-tems , pages 3266–3280.Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., andBowman, S. R. (2018). Glue: A multi-task bench-mark and analysis platform for natural language un-derstanding. arXiv preprint arXiv:1804.07461 .Wang, Y., Huang, M., Zhu, X., and Zhao, L. (2016).Attention-based lstm for aspect-level sentiment clas-siﬁcation. In

Proceedings of the 2016 conference onempirical methods in natural language processing ,pages 606–615.Wojatzki, M., Ruppert, E., Holschneider, S., Zesch,T., and Biemann, C. (2017). GermEval 2017:Shared Task on Aspect-based Sentiment in SocialMedia Customer Feedback. In

Proceedings of theGermEval 2017 – Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback , pages1–12, Berlin, Germany.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-icz, M., Davison, J., Shleifer, S., von Platen, P., Ma,C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gug-ger, S., Drame, M., Lhoest, Q., and Rush, A. M.(2020). Transformers: State-of-the-Art Natural Lan-guage Processing. In

Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing: System Demonstrations , pages 38–45,Online. Association for Computational Linguistics.Wu, Z. and Ong, D. C. (2020). Context-guided bertfor targeted aspect-based sentiment analysis. arXivpreprint arXiv:2010.07523 .Xu, H., Liu, B., Shu, L., and Yu, P. S. (2019). Bertpost-training for review reading comprehension andaspect-based sentiment analysis.ang, H., Zeng, B., Yang, J., Song, Y., and Xu, R.(2019). A multi-task learning model for chinese-oriented aspect polarity classiﬁcation and aspectterm extraction. arXiv preprint arXiv:1912.07976 . AppendixA Detailed results (per category) forSubtask C

It may be interesting to have a more detailed lookat the model performance for this subtask becauseof the high number of classes and their skeweddistribution by investigating the performance oncategory-level. Table 15 shows the performanceof the uncased German BERT-BASE model by dbmdz per test set for Subtask C1. The supportindicates the number of appearances, which arealso displayed in Table 4 in this case. Seven cat-egories are summarized in

Rest because they havean F1 score of 0 for both test sets, i.e. the model isnot able to correctly identify any of these seven as-pects appearing in the test data. The table is sortedby the score on the synchronic test set. test syn test dia

Aspect Category Score Support Score Support

Allgemein 0.854 1,398 0.877 1,024Sonstige Unregelm¨aßigkeiten 0.782 224 0.785 164Connectivity 0.750 36 0.838 73Zugfahrt 0.678 241 0.687 184Auslastung und Platzangebot 0.645 35 0.667 20Sicherheit 0.602 84 0.639 42Atmosph¨are 0.600 148 0.532 53Barrierefreiheit 0.500 9 0 2Ticketkauf 0.481 95 0.506 48Service und Kundenbetreuung 0.476 63 0.417 27DB App und Website 0.455 28 0.563 18Informationen 0.329 58 0.464 35Komfort und Ausstattung 0.286 24 0 11

Rest 0 24 0 20

Table 15: Micro-averaged F1 scores and support by as-pect category (Subtask C1). Seven categories are sum-marized in

Rest and show each a score of 0.

The F1 scores for

Allgemein ( General ), Sonstige Unregelm¨aßigkeiten ( Otherirregularities ) and

Connectivity are thehighest. 13 categories, mostly similar betweenthe two test sets, show a positive F1 score on atleast one of the two test sets. For the categoriessubsumed under

Rest , the model was not able tolearn how to correctly identify these categories.Subtask C2 exhibits a similar distribution of thetrue labels, with the

Aspect+Sentiment category

Allgemein:neutral as majority class. Over50% of the true labels belong to this class. Ta-ble 16 shows that only 12 out of 60 labels can bedetected by the model (see Table 16). test syn test dia

Aspect+Sentiment Category Score Support Score Support

Allgemein:neutral 0.804 1,108 0.832 913Sonstige Unregelm¨aßigkeiten:negative 0.782 221 0.793 159Zugfahrt:negative 0.645 197 0.725 149Sicherheit:negative 0.640 78 0.585 39Allgemein:negative 0.582 258 0.333 80Atmosph¨are:negative 0.569 126 0.447 39Connectivity:negative 0.400 20 0.291 46Ticketkauf:negative 0.364 42 0.298 34Auslastung und Platzangebot:negative 0.350 31 0.211 17Allgemein:positive 0.214 41 0.690 33Zugfahrt:positive 0.154 34 0 34Service und Kundenbetreuung:negative 0.146 36 0.174 21

Rest 0 343 0 180

Table 16: Micro-averaged F1 scores and support by

As-pect+Sentiment category (Subtask C2). 48 categoriesare summarized in

Rest and show each a score of 0.

All the aspect categories displayed in Ta-ble 16 are also visible in Table 15 andmost of them have negative sentiment.

Allgemein:neutral and

SonstigeUnregelm¨aßigkeiten:negative showthe highest scores. Again, we assume that here,48 categories could not be identiﬁed due to datasparsity. However, having this in mind, the modelachieves a relatively high overall performancefor both, Subtask C1 and C2 (cf. Tab. 9 andTab. 10). This is mainly owed to the highscore of the majority classes

Allgemein and

Allgemein:neutral , respectively, becausethe micro F1 score puts a lot of weight on majorityclasses. It might be interesting whether the clas-siﬁcation of the rare categories can be improvedby balancing the data. We experimented with re-moving general categories such as

Allgemein , Allgemein:neutral or documents withsentiment neutral since these are usually lessinteresting for a company. We observe a largedrop in the overall F1 score which is attributed tothe absence of the strong majority class and theresulting data loss. Indeed, the classiﬁcation forsome single categories could be improved, but therare categories could still not be identiﬁed by themodel.

Detailed results (per category) forSubtask D

Similar as for Subtask C, the results for the bestmodel are investigated in more detail. Table 17gives the detailed classiﬁcation report for the un-cased German BERT-BASE model with CRF layeron Subtask D1. Only entities that were correctlydetected at least once are displayed. The table issorted by the score on the synchronic test set. Theclassiﬁcation report for Subtask D2 is displayedanalogously in Table 18. test syn test dia

Category Score Support Score Support

Zugfahrt:negative 0.702 622 0.729 495Sonstige Unregelm¨aßigkeiten:negative 0.681 693 0.581 484Sicherheit:negative 0.604 337 0.457 122Connectivity:negative 0.598 56 0.620 109Barrierefreiheit:negative 0.595 14 0 3Auslastung und Platzangebot:negative 0.579 66 0.447 31Connectivity:positive 0.571 26 0.555 60Allgemein:negative 0.545 807 0.343 139Atmosph¨are:negative 0.500 403 0.337 164Ticketkauf:negative 0.383 96 0.583 74Ticketkauf:positive 0.368 59 0 13Komfort und Ausstattung:negative 0.357 24 0 16Atmosph¨are:neutral 0.348 40 0.111 14Service und Kundenbetreuung:negative 0.323 74 0.286 31Informationen:negative 0.301 68 0.505 46Zugfahrt:positive 0.276 62 0.343 83DB App und Website:negative 0.232 39 0.375 33DB App und Website:neutral 0.188 23 0 11Sonstige Unregelm¨aßigkeiten:neutral 0.179 13 0.222 2Allgemein:positive 0.157 86 0.586 92Service und Kundenbetreuung:positive 0.115 23 0 5Atmosph¨are:positive 0.105 26 0 15Ticketkauf:neutral 0.040 144 0.222 25Connectivity:neutral 0 11 0.211 15Toiletten:negative 0 15 0.160 23

Rest 0 355 0 115

Table 17: Micro-averaged F1 scores and support by

As-pect+Sentiment entity with exact match (Subtask D1).35 categories are summarized in

Rest , each of them ex-hibiting ascore of 0.

For Subtask D1, the model returns a pos-itive score on 25 entity categories on atleast one of the two test sets. The category

Zugfahrt:negative can be classiﬁed beston both test sets, followed by

SonstigeUnregelm¨aßigkeiten:negative and

Sicherheit:negative for the synchronictest set and by

Connectivity:negative and

Allgemein:positive for the diachronicset. Visibly, the scores between the two test setsdiffer more here than in the classiﬁcation reportof the previous task.The report for the overlapping match(cf. Tab. 18) shows slightly better re-sults on some categories than for the ex-act match. The third-best score on thediachronic test data is now

Sonstige test syn test dia

Category Score Support Score Support

Zugfahrt:negative 0.708 622 0.739 495Sonstige Unregelm¨aßigkeiten:negative 0.697 693 0.617 484Sicherheit:negative 0.607 337 0.475 122Connectivity:negative 0.598 56 0.620 109Barrierefreiheit:negative 0.595 14 0 3Auslastung und Platzangebot:negative 0.579 66 0.447 31Connectivity:positive 0.571 26 0.555 60Allgemein:negative 0.561 807 0.363 139Atmosph¨are:negative 0.505 403 0.358 164Ticketkauf:negative 0.383 96 0.583 74Ticketkauf:positive 0.368 59 0 13Komfort und Ausstattung:negative 0.357 24 0 16Atmosph¨are:neutral 0.348 40 0.111 14Service und Kundenbetreuung:negative 0.323 74 0.286 31Informationen:negative 0.301 68 0.505 46Zugfahrt:positive 0.276 62 0.343 83DB App und Website:negative 0.261 39 0.406 33DB App und Website:neutral 0.188 23 0 11Sonstige Unregelm¨aßigkeiten:neutral 0.179 13 0.222 2Allgemein:positive 0.157 86 0.586 92Service und Kundenbetreuung:positive 0.115 23 0 5Atmosph¨are:positive 0.105 26 0 15Ticketkauf:neutral 0.040 144 0.222 25Connectivity:neutral 0 11 0.211 15Toiletten:negative 0 15 0.160 23

Rest 0 355 0 112

Table 18: Micro-averaged F1 scores and support by

Aspect+Sentiment entity with overlapping match (Sub-task D2). 35 categories are summarized in

Rest andshow each a score of 0.