[PDF] Learning From Revisions: Quality Assessment of Claims in Argumentation at Scale

Abstract

Assessing the quality of arguments and of the claims the arguments are composed of has become a key task in computational argumentation. However, even if different claims share the same stance on the same topic, their assessment depends on the prior perception and weighting of the different aspects of the topic being discussed. This renders it difficult to learn topic-independent quality indicators. In this paper, we study claim quality assessment irrespective of discussed aspects by comparing different revisions of the same claim. We compile a large-scale corpus with over 377k claim revision pairs of various types from kialo.com, covering diverse topics from politics, ethics, entertainment, and others. We then propose two tasks: (a) assessing which claim of a revision pair is better, and (b) ranking all versions of a claim by quality. Our first experiments with embedding-based logistic regression and transformer-based neural networks show promising results, suggesting that learned indicators generalize well across topics. In a detailed error analysis, we give insights into what quality dimensions of claims can be assessed reliably. We provide the data and scripts needed to reproduce all results.

Full PDF

LLearning From Revisions:Quality Assessment of Claims in Argumentation at Scale

Gabriella Skitalinskaya Jonas Klaff

Department of Computer ScienceUniversity of BremenBremen, Germany {gabski, joklaff}@uni-bremen.de

Henning Wachsmuth

Department of Computer SciencePaderborn UniversityPaderborn, Germany [email protected]

Abstract

Assessing the quality of arguments and of theclaims the arguments are composed of has be-come a key task in computational argumenta-tion. However, even if different claims sharethe same stance on the same topic, their as-sessment depends on the prior perception andweighting of the different aspects of the topicbeing discussed. This renders it difﬁcult tolearn topic-independent quality indicators. Inthis paper, we study claim quality assessmentirrespective of discussed aspects by compar-ing different revisions of the same claim . Wecompile a large-scale corpus with over 377kclaim revision pairs of various types from kialo.com , covering diverse topics from pol-itics, ethics, entertainment, and others. Wethen propose two tasks: (a) assessing whichclaim of a revision pair is better, and (b) rank-ing all versions of a claim by quality. Ourﬁrst experiments with embedding-based logis-tic regression and transformer-based neuralnetworks show promising results, suggestingthat learned indicators generalize well acrosstopics. In a detailed error analysis, we give in-sights into what quality dimensions of claimscan be assessed reliably. We provide the dataand scripts needed to reproduce all results. Assessing argument quality is as important as itis questionable in nature. On the one hand, iden-tifying the good and the bad claims and reasonsfor arguing on a given topic is key to convincinglysupport or attack a stance in debating technologies(Rinott et al., 2015), argument search (Ajjour et al.,2019), and similar. On the other hand, argumentquality can be considered on different granular-ity levels and from diverse perspectives, many ofwhich are inherently subjective (Wachsmuth et al.,2017a); they depend on the prior beliefs and stance Data and code:https://github.com/GabriellaSky/claimrev on a topic as well as on the personal weighting ofdifferent aspects of the topic (Kock, 2007).Existing research largely ignores this limitation,by focusing on learning to predict argument qualitybased on subjective assessments of human anno-tators (see Section 2 for examples). In contrast,Habernal and Gurevych (2016) control for topicand stance to compare the convincingness of argu-ments. Wachsmuth et al. (2017b) abstract from anargument’s text, assessing its relevance only struc-turally. Lukin et al. (2017) and El Baff et al. (2020)focus on personality-speciﬁc and ideology-speciﬁcquality perception, respectively, whereas Toledoet al. (2019a) asked annotators to disregard theirown stance in judging length-restricted arguments.However, none of these approaches controls for theconcrete aspects of a topic that the arguments claimand reason about. This renders it difﬁcult to learnwhat makes an argument and its building blocksgood or bad in general.In this paper, we study quality in argumentationirrespective of the discussed topics, aspects, andstances by assessing different revisions of the basicbuilding blocks of arguments, i.e., claims. Suchrevisions are found in large quantities on online de-bate platforms such as kialo.com , where users postclaims, other users suggest revisions to improveclaim quality (in terms of clarity, grammaticality,grounding, etc.), and moderators approve or disap-prove them. By comparing the quality of differentrevisions of the same instance, we argue that wecan learn general quality characteristics of argu-mentative text and, to a wide extent, abstract fromprior perceptions and weightings.To address the proposed problem, we present anew large-scale corpus, consisting of 124k uniqueclaims from kialo.com spanning a diverse range oftopics related to politics, ethics, and several others(Section 3). Using distant supervision, we derive atotal number of 377k claim revision pairs from the a r X i v : . [ c s . C L ] J a n laim before Revision Claim after Revision Type Dogs can help disabled people functionbetter. Dogs can help disabled people to navigate the worldbetter. Claim Clariﬁca-tionAfrican American soldiers joined unioniststo ﬁght for their freedom. Black soldiers joined unionists to ﬁght for their freedom. Typo / GrammarCorrectionElections insure the independence of thejudiciary. Elections ensure the independence of the judiciary. Typo / GrammarCorrectionIsrael has a track record of selling US armsto third countries without authorization. Israel has a track record of selling USarms to third countries without authorization( ). Corrected / Addedlinks

Table 1: Four examples of claims from Kialo before and after revision, along with the type of revision performed. platform, each reﬂecting a quality improvement, of-ten, with a speciﬁed revision type. Four examplesare shown in Table 1. To the best of our knowledge,this is the ﬁrst corpus to target quality assessmentbased on claim revisions. In a manual annotationstudy, we provide support for our underlying hy-pothesis that a revision improves a claim in mostcases, and we test how much the revision types cor-relate with known argument quality dimensions.Given the corpus, we study two tasks: (a) howto compare revisions of a claim by quality and(b) how to rank a set of claim revisions. As initialapproaches to the ﬁrst task, we select in Section 4a “traditional” logistic regression model based onword embeddings as well as transformer-based neu-ral networks (Vaswani et al., 2017), such as BERT(Devlin et al., 2019) and SBERT (Reimers andGurevych, 2019). For the ranking task, we considerthe Bradley-Terry-Luce model (Bradley and Terry,1952; Luce, 2012) and SVMRank (Joachims, 2006).They achieve promising results, indicating that thecompiled corpus allows learning topic-independentcharacteristics associated with the quality of claims(Section 5). To understand what claim quality im-provements can be assessed reliably, we then carryout a detailed error analysis for different revisiontypes and numbers of revisions.The main contributions of our work are: (1) Anew corpus for topic-independent claim qualityassessment, with distantly supervised quality im-provement labels of claim revision pairs, (2) initialpromising approaches to the tasks of claim qual-ity classiﬁcation and ranking, and (3) insights intowhat works well in claim quality assessment andwhat remains to be solved.

In the recent years, there has been an increase ofresearch on the quality of arguments and the claims and reasoning they are composed of. Wachsmuthet al. (2017a) describe argumentation quality as amultidimensional concept that can be consideredfrom a logical, rhetorical, and dialectical perspec-tives. To achieve a common understanding, the au-thors suggest a uniﬁed framework with 15 qualitydimensions, which together give a holistic qualityevaluation at a certain abstraction level. They pointout, that several dimensions may be perceived dif-ferently depending on the target audience. In recentfollow-up work, Wachsmuth and Werner (2020) ex-amined how well each dimension can be assessedonly based on plain text only.Most existing quality assessment approaches tar-get a single dimension. On mixed-topic studentessays, Persing and Ng (2013) learn to score theclarity of an argument’s thesis, Persing and Ng(2015) do the same for argument strength, andStab and Gurevych (2017) classify whether an argu-ment’s premises sufﬁciently support its conclusion.All these are trained on pointwise quality annota-tions in the form of scores or binary judgments.Gretz et al. (2019) provide a corpus with crowd-sourced quality annotations for 30,497 arguments,the largest to date for pointwise argument quality.The authors studied how their annotations corre-late with the 15 dimensions from the framework ofWachsmuth et al. (2017a), ﬁnding that only globalrelevance and effectiveness are captured. Similarly,Lauscher et al. (2020) built a new corpus based onthe framework to then exploit interactions betweenthe dimensions in a neural approach. We present asmall related annotation study for our dataset be-low. However, we follow Habernal and Gurevych(2016) in that we cast argument quality assessmentas a relation classiﬁcation problem, where the goalis to identify the better among a pair of instances.In particular, Habernal and Gurevych (2016) cre-ated a dataset with argument convincingness pairsn 32 topics. To mitigate annotator bias, the ar-guments in a pair always have the same stance onthe same topic. The more convincing argumentis then predicted using a feature-rich SVM anda simple bidirectional LSTM. Other approachesto the same task map passage representations toreal-valued scores using Gaussian Process Prefer-ence Learning (Simpson and Gurevych, 2018) orrepresent arguments by the sum of their token em-beddings (Potash et al., 2017), later extended by aFeed Forward Neural Network (Potash et al., 2019).Recently, Gleize et al. (2019) employed a Siameseneural network to rank arguments by the convinc-ingness of evidence. In our experiments below, wetake on some of these ideas, but also explore the im-pact of transformer-based methods such as BERT(Devlin et al., 2019), which have been shown topredict argument quality well (Gretz et al., 2019).Potash et al. (2017) observed that longer argu-ments tend to be judged better in existing corpora, aphenomenon we will also check for below. Toledoet al. (2019b) prevent such bias in their corpora forboth pointwise and pairwise quality, by restrictingthe length of arguments to 8–36 words. The authorsdeﬁne quality as the level of preference for an ar-gument over other arguments with the same stance,asking annotators to disregard their own stance. Fora more objective assessment of argument relevance,Wachsmuth et al. (2017b) abstract from content,ranking arguments only based on structural rela-tions, but they employ majority human assessmentsfor evaluation. Lukin et al. (2017) take a differentapproach, including knowledge about the person-ality of the reader into the assessment, and El Baffet al. (2020) study the impact of argumentative textson people depending on their political ideology.As can be seen, several approaches aim to con-trol for length, stance, audience, or similar. How-ever, all of them still compare argumentative textswith different content and meaning in terms of theaspects of topics being discussed. In this work, weassess quality based on different revisions of thesame text. In this setting, the quality is primarilyfocused on how a text is formulated, which willhelp to better understand what inﬂuences argumentquality in general, irrespective of the topic. To beable to do so, we refer to online debate portals.Debate portals give users the opportunity to dis-cuss their views on a wide range of topics. Exist-ing research has used the rich argumentative con-tent and structure of different portals for argument mining, including createdebate.com (Habernal andGurevych, 2015), idebate.org (Al-Khatib et al.,2016), and others. Also, large-scale debate portaldatasets form the basis of applications such as argu-ment search engines (Ajjour et al., 2019). Unlikethese works, we exploit debate portals for study-ing quality . Tan et al. (2016) predicted argumentpersuasiveness in the discussion forum

Change-MyView from ground-truth labels given by opinionposters, and Wei et al. (2016) used user upvotes anddownvotes for the same purpose. Here, we resortto kialo.com , where users cannot only state argu-mentative claims and vote on the impact of claimssubmitted by others, but they can also help improveclaims by suggesting revisions, which are approvedor disapproved by moderators. While Durmus et al.(2019) assessed quality based on the impact valueof claims from kialo.com, we derive informationon quality from the revision history of claims.The only work we are aware of that analyzesrevision quality of argumentative texts is the studyof Afrin and Litman (2018). From the corpus ofZhang et al. (2017) containing 60 student essayswith three draft versions each, 940 sentence writ-ing revision pairs were annotated for whether therevision improves essay quality or not. The authorsthen trained a random forest classiﬁer for automaticrevision quality classiﬁcation. In contrast, insteadof sentences, we shift our focus to claims. More-over, our dataset is orders of magnitude larger andincludes notably longer revision chains, which en-ables deeper analyses and more reliable predictionof revision quality using data-intensive methods.

Here, we present our corpus created based on claimrevision histories collected from kialo.com . Kialo is a typical example of an online debate portalfor collaborative argumentative discussions, whereparticipants jointly develop complex pro/con de-bates on a variety of topics. The scope ranges fromgeneral topics (religion, fair trade, etc.) to very spe-ciﬁc ones, for instance, on particular policy-making(e.g., whether wealthy countries should provide cit-izens with a universal basic income). Each debateconsists of a set of claims and is associated with alist of related pre-deﬁned generic categories, suchas politics, ethics, education, and entertainment.What differentiates Kialo from other portals is orpus Type of Instances Instances

ClaimRev

BASE

Total claim pairs 210 222

Claim Clariﬁcation 63 729Typo/Grammar Correction 59 690Corrected/Added Links 17 882Changed Meaning of Claim 1 178Misc 10 464None 57 279ClaimRev

EXT

Total claim pairs 377 659

Revision distance 1 77 217Revision distance 2 27 819Revision distance 3 10 753Revision distance 4 4 460Revision distance 5 2 055Revision distance 6+ 2 008Both Corpora

Claim revision chains 124 312

Table 2: Statistics of the two provided corpus versions.ClaimRev

BASE : Number of claim pairs in total and ofeach revision type. ClaimRev

EXT : Number of claimpairs in total and of each revision distance. The bot-tom line shows the number of unique revision chains inthe corpora. that it allows editing claims and tracking changesmade in a discussion. All users can help improveexisting claims by suggesting edits, which are thenaccepted or rejected by the moderator team of thedebate. As every suggested change is discussed bythe community, this collaborative process shouldlead to a continuous improvement of claim qualityand a diverse set of claims for each topic.As a result of the editing process, claims in adebate have a version history in the format of claimpairs, forming a chain where one claim is the suc-cessor of another and is considered to be of higherquality (examples found in Table 1). In addition,claim pairs may have a revision type label assignedto them via a non-mandatory free form text ﬁeld,where moderators explain the reason of revision.

Base Corpus

To compile the corpus, we scrapedall 1628 debates found on Kialo until June 26th,2020, related to over 1120 categories. They contain124,312 unique claims along with their revisionhistories, which comprise of 210,222 pairwise rela-tions. The average number of revisions per claimis 1.7 and the maximum length of a revision chainis 36. 74% of all pairs have a revision type. Overall,there are 8105 unique revision type labels in the cor-pus. 92% of labeled claim pairs refer to three typesonly:

Claim Clariﬁcation , Typo/Grammar Correc-tion , and

Corrected/Added Links . An overview ofthe distribution of revision labels is given in Table 2.We refer to the resulting corpus as

ClaimRev

BASE . Figure 1: Visual representation of relations between re-visions. Solid and dashed lines denote original and in-ferred non-consecutive relations respectively.

Data pre-processing included removing all claimpairs from debates carried out in languages otherthan English. Also, we considered claims with lessthan four characters as uninformative and left themout. As we seek to compare different versions ofthe same claim, claim version pairs with a generalchange of meaning do not satisfy this description.Thus, we removed such pairs from the corpus, too(inspecting the data revealed that such pairs weremostly generated due to debate restructuring). Forthis, we assessed the cosine similarity of a givenclaim pair using spacy.io and remove a pair if thescore is lower than the threshold of 0.8.

Extended Corpus

To increase the diversity ofdata available for training models, without actuallycollecting new data, we applied data augmentation.ClaimRev

BASE consists of consecutive claim versionpairs, i.e., if a claim v has four versions, it will berepresented by three three pairs: ( v , v ) , ( v , v ) ,and ( v , v ) , where v is the original claim and v is the latest version. We extend this data by addingall pairs between non-consecutive versions that areinferrable transitively. Considering the previousexample, this means we add ( v , v ) , ( v , v ) , and ( v , v . This is based on our hypothesis that ev-ery argument version is of higher quality than itspredecessors, which we come back to below. Fig-ure 1 illustrates the data augmentation. We call theaugmented corpus ClaimRev

EXT .For this corpus, we introduce the concept of re-vision distance , by which we mean the number ofrevisions between two versions. For example, thedistance between v and v would be 1, whereas thedistance between v and v would be 2. The distri-bution of the revision distances across ClaimRev EXT is summarized in Table 2.The number of claim pairs of the 20 most fre-quent categories in both corpus versions are pre-sented in Figure 2. We will restrict our view to thetopics in these categories in our experiments. limate changeEducationDemocracyJusticeChildrenEuropeTechnologyScienceEqualityGenderEconomicsGovernmentPhilosophyHealthReligionLawUSASocietyEthicsPolitics

13k 16k 17k 17k 17k 21k 23k 26k 26k 28k 31k 32k 34k 41k 45k 56k 65k 73k 95k 147k

ClaimRev

BASE

ClaimRev

EXT

30k 60k 90k 120k 150k

Figure 2: Number of claim revision pairs in each de-bate category of the two provided versions of our cor-pus (ClaimRev

BASE , ClaimRev

EXT ). While collaborative content creation enables lever-aging the wisdom of large groups of individualstoward solving problems, it also poses challengesin terms of quality control, because it relies onvarying perceptions of quality, backgrounds, exper-tise, and personal objectives of the moderators. Toassess the consistency of the distantly-supervisedcorpus annotations, we carried out two annotationstudies on samples of our corpus.

Consistency of Relative Quality

In this study,we aimed to capture the general perception of claimquality on a meta-level, by deriving a data-drivenquality assessment based on the revision histories.This was based on our hypothesis that every claimversion is better than its predecessor. To test thevalidity of this hypothesis, two authors of this paperannotated whether a revision increases, decreases,or does not affect the overall claim quality. For thispurpose, we randomly sampled 315 claim revisionpairs, found in the supplementary material.The results clearly support our hypothesis, show-ing an increase in quality in 292 (93%) of the anno-tated cases at a Cohen’s κ agreement of 0.75, while8 (3%) of the revisions had no effect on quality andonly 6 (2%) led to a decrease. On the remaining2%, the annotators did not reach an agreement. Consistency of Revision Type Labels

Our sec-ond annotation study focused on the reliability ofthe revision type labels. We restricted our viewto the top three revision labels, which cover 96%of all revisions. We randomly sampled 140–150claim pairs per each revision type, 440 in total.For each claim pair, the same annotators as aboveprovided a label for the revision type from the fol-lowing set:

Claim Clariﬁcation , Typo/GrammarCorrection , Corrected/Added Links , and

Other .Comparing the results to the original labels in thecorpus revealed that the annotators strongly agreedwith the labels, namely, with Cohen’s κ of 0.82 and0.76 respectively. The level of agreement betweenthe annotators was even higher ( κ = 0.84). In fur-ther analysis, we observed that most confusion hap-pened between the revision types Typo/Grammarcorrection and

Claim Clariﬁcation . This may bedue to the non-strict nature of the revision typelabels, which leaves space for different interpreta-tions on a case-to-case basis. Still, we conclude thatthe revision type labels seem reliable in general.

To explore the relationship between the revisiontypes on Kialo and argument quality in general, weconducted a third annotation study. In particular,for each of the 315 claim pairs from Section 3.2,one of the authors of this paper provided a labelindicating whether the revision improved for eachof the 15 quality dimensions deﬁned by Wachsmuthet al. (2017a) or not. It should be noted that theannotators reached an agreement on the revisiontype for all these pairs.Table 3 shows Pearson’s r rank correlation foreach quality dimension for the three main revisiontypes. We observe a strong correlation between therevision type Corrected/Added Links and the logi-cal quality dimensions

Cogency (0.65) and

LocalSufﬁciency (0.62), which matches the main purposeof such revisions: to add supporting informationto a claim. The high negative correlation of thisrevision type with

Global Acceptability (-0.82) indi-cates that improvements regarding the dimension inquestion are more prominent in other types. Com-plementarily,

Claim Clariﬁcation mainly improvesthe other logical dimensions (

Local Acceptability

Local Relevance

Typo/Grammar cor-rections , ﬁnally, rather seem to support an accept- lariﬁcation Grammar LinksCogency -0.31 -0.31 0.65

Local Acceptability -0.20 -0.19Local Relevance -0.25 -0.22Local Sufﬁciency -0.28 -0.33 0.62Effectiveness -0.35 0.34

Credibility 0.06 -0.16 0.10Emotional Appeal 0.00 0.00 0.00Clarity -0.16 -0.18Appropriateness 0.01 0.02 -0.04Arrangement 0.00 0.00 0.00

Reasonableness

Global Relevance 0.02 -0.43 0.42

Global Sufﬁciency 0.00 0.00 0.00

Overall -0.05 0.00 0.05Pairs with revision type 120 100 95

Table 3: Pearson’s r correlation in our annotationstudy between increases in the 15 quality dimensions ofWachsmuth et al. (2017a) and the main revision types:Claim Clariﬁcation , Typo/

Grammar

Correction, Cor-rected/Added

Links . Moderate and high correlationsare shown in bold ( r ≥ . ). able linguistic shape, improving Clarity (0.35) and

Global Acceptability (0.42).Finding only low correlations for many rhetori-cal dimensions (credibility, emotional appeal, etc.)as well as for overall quality, we conclude thatthe revisions on Kialo seem to target primarily thegeneral form a well-phrased claim should have.

To study the two proposed tasks, claim quality clas-siﬁcation and claim quality ranking, on the givencorpus, we consider the following approaches.

We cast this task as a pairwise classiﬁcation task,where the objective is to compare two versions ofthe same claim and determine which one is better.To solve this task, we compare four methods:

Length

To check whether there is a bias towardslonger claims in the data, we use a trivial methodwhich assumes that claims with more charactersare better.

S-BOW

As a “traditional” method, we employthe siamese bag-of-words embedding (S-BOW) asdescribed by Potash et al. (2017). We concatenatetwo bag-of-words matrices, each representing aclaim version from a pair, and input the concate-nated matrix to a logistic regression. We also testwhether information on length improves S-BOW. v v v v v v Table 4: Example of a pairwise score matrix for rank-ing of three claim revisions, v – v , given the followingpairwise scores: ( v , v ) = (0 . , . , ( v , v ) =(0 . , . , and ( v , v ) = (0 . , . . BERT

We select the BERT model, as it has be-come the standard neural baseline. BERT is apre-trained deep bidirectional transformer languagemodel (Devlin et al., 2019). For our experimentswe use the pre-trained version bert-base-cased , asimplemented in the huggingface library. We ﬁne-tune the model for two epochs using the Adamoptimizer with learning rate 1e-5. SBERT

We also use Sentence-BERT (SBERT)to learn to represent each claim version as a sen-tence embedding (Reimers and Gurevych, 2019),opposed to the token-level embeddings of standardBERT models. We ﬁne-tune SBERT based on bert-base-cased using a siamese network structure, asimplemented in the sentence-transformers library. We set the numbers of epochs to one which is rec-ommended by the authors (Reimers and Gurevych,2019), and we use a batch-size of 16, Adam opti-mizer with learning rate 1e-5, and a linear learningrate warm-up over 10% of the training data. Ourdefault pooling strategy is MEAN.

In contrast to the previous task, we cast this prob-lem as a sequence-pair regression task. After ob-taining all pairwise scores using S-BOW, BERT,and SBERT respectively, we map the pairwise la-bels to real-valued scores and rank them using thefollowing models, once for each method.

BTL

For mapping, we use the well-establishedBradley-Terry-Luce (BTL) model (Bradley andTerry, 1952; Luce, 2012), in which items are rankedaccording to the probability that a given item beatsan item chosen randomly. We feed the BTL modela pairwise-comparison matrix for all revisions re-lated to a claim, generated as follows: Each row Huggingface library, https://huggingface.co/transformers/pretrained_models.html We chose the number of epochs empirically, picking thebest learning rate out of {5e-7, 5e-6,1e-5,2e-5,3e-5}. est set: ClaimRev BASE

Test set: ClaimRev

EXT

Random-Split Cross-Category Random-Split Cross-CategoryModel Accuracy MCC Accuracy MCC Accuracy MCC Accuracy MCC

Length 61.3 / 61.3 0.23 / 0.23 60.7 / 60.7 0.21 / 0.21 60.8 / 60.8 0.22 / 0.22 60.0 / 60.0 0.20 / 0.20SBOW 62.0 / 62.6 0.24 / 0.25 61.4 / 61.4 0.23 / 0.23 64.9 / 65.4 0.30 / 0.31 63.9 / 64.1 0.28 / 0.28SBOW + Length 65.1 / 65.5 0.30 / 0.31 64.8 / 64.4 0.29 / 0.29 67.1 / 67.5 0.34 / 0.35 66.1 / 66.2 0.32 / 0.32BERT 75.5 / 75.2 0.51 / 0.51 75.1 / 74.1 / 0.49 76.4 / 76.5 0.53 / 0.53 76.2 / 75.4 0.53 / 0.51SBERT / / / / Random baseline 50.0 / 50.0 0.00 / 0.00 50.0 / 50.0 0.00 / 0.00 50.0 / 50.0 0.00 / 0.00 50.0 / 50.0 0.00 / 0.00Single claim baseline 57.7 / 58.1 0.17 / 0.17 57.7 / 57.3 0.17 / 0.16 58.8 / 59.8 0.20 / 0.20 58.9 / 58.9 0.20 / 0.20

Table 5: Claim quality classiﬁcation results: Accuracy and Matthew Correlation Coefﬁcient (MCC) for all testedapproaches in the random-split and the cross-category setting on the two corpus versions. The ﬁrst value in eachvalue pair is obtained by a model trained on ClaimRev

BASE , the second by a model trained on ClaimRev

EXT . Allimprovements from one row to the next are signiﬁcant at p < t -test. represents the probability of the revision being bet-ter than other revisions. All diagonal values are setto zero. Table 4 illustrates an example for a set ofthree argument revisions. SVMRank

Additionally, we employ SVMRank(Joachims, 2006), which views the ranking problemas a pairwise classiﬁcation task. First, we changethe input data, provided as a ranked list, into a setof ordered pairs, where the (binary) class label forevery pair is the order in which the elements ofthe pair should be ranked. Then, SVMRank learnsby minimizing the error of the order relation whencomparing all possible combinations of candidatepairs. Given the nature of the algorithm we cannotwork with token embeddings obtained from BERTdirectly. Thus, we utilize one of most commonlyused approaches to transform token embeddingsto a sentence embedding: extracting the special[CLS] token vector (Reimers and Gurevych, 2019;May et al., 2019). In our experiments we select alinear kernel for the SVM and use PySVMRank, apython API to the SVM rank library written in C. We now present empirical experiments with theapproaches from Section 4. The goal is to evalu-ate how hard it is to compare and rank the claimrevisions in our corpus from Section 3 by quality.

We carry out experiments in two settings. Theﬁrst considers creating random splits over revisionhistories, ensuring that all versions of the same PySVMRank, https://github.com/ds4dm/PySVMRank SVM rank claim are in a single split in order to avoid dataleakage. We assign 80% of the revision historiesto the training set and the remaining 20% to thetest set. A drawback of this setup is that it is notclear how well models generalize to unseen debatecategories. In the second setting, we therefore eval-uate the methods also in a cross-category setupusing a leave-one-category-out paradigm, whichensures that all claims from the same debate cat-egory are conﬁned to a single split. We split thedata in this way to evaluate if our models learnindependent features that are applicable across thediverse set of categories. To assess the effect ofadding augmented data, we evaluate all models onboth ClaimRev

BASE and ClaimRev

EXT .For quality classiﬁcation , we report accuracy andthe Matthews correlation coefﬁcient (Matthews,1975). We report the mean results over ﬁve runs inthe random setting and the mean results across alltest categories in the cross-category setting. To en-sure balanced class labels, we create one false claimpair for each true claim pair by shufﬂing the or-der of the claims: ( v , v , true ) → ( v , v , f alse ) ,where the label denotes whether the second claimin the pair is of higher quality. We report resultsobtained by models trained on ClaimRev BASE andClaimRev

EXT as score pairs in Table 5.To measure ranking performance, we calculatePearson’s r and Spearman’s ρ correlation, as wellas NDCG and MRR. We also compute the Top-1accuracy, i.e. the proportion of claim sets, wherethe latest version has been ranked best. We averagethe results on each claim set across the test set foreach metric. Afterwards we average the resultsacross ﬁve runs or across all categories, dependingon the chosen setting. andom-Split Cross-CategoryModel r ρ Top-1

NDCG MRR r ρ

Top-1

NDCG MRRBTL + SBOW+L 0.38 0.37 0.62 0.94 0.79 0.36 0.35 0.60 0.94 0.78BTL + BERT 0.60 0.59 0.74 0.96 0.86 0.58 0.57 0.72 0.96 0.85BTL + SBERT 0.63 0.62 0.77

Random baseline 0.00 0.00 0.42 0.91 0.68 0.00 0.00 0.42 0.91 0.67

Table 6: Claim quality ranking results: Pearson’s r and Spearman’s ρ correlation as well as top-1 accuracy forall tested approaches in the random-split and the cross-category setting on ClaimRev EXT .In all cases, SVMRank +SBERT is signiﬁcantly better than all others at p < t -test. The results in Table 5 show that a claim’s length isa weak indicator of quality (up to 61.3 accuracy).An intuitive explanation is that, even though claimswith more information may be better, it is alsoimportant to keep them readable and concise.Despite

SBOW ’s good performance on predict-ing convincingness (Potash et al., 2017), the claimquality in our corpus cannot be captured by amodel of such simplicity (maximum accuracy of65.4). We point out that adding other linguistic fea-tures (for example, part-of-speech tags or sentimentscores) may further improve SBOW. Exemplarily,we equip SBOW with length features and observea signiﬁcant improvement (up to 67.5).As for the transformer-based methods, we seethat

BERT and

SBERT consistently outperformSBOW in all settings on both corpus versions, withSBERT’s accuracy of up to 77.7 being best. A comparison of the performance of the meth-ods depending on the corpus used for training inTable 5 shows the effect of augmenting the origi-nal Kialo data. In most cases, the results obtainedby models trained on ClaimRev

EXT are compara-ble (slightly higher/lower) than results obtained bymodels trained on ClaimRev

BASE . This means thatadding relations between non-consecutive claimversions does not improve the reliability of meth-ods. Given that the performance scores obtained onthe ClaimRev

EXT test set are evidently higher thanon the ClaimRev

BASE test set, we can conclude thatthe augmented cases are easier to classify and thecumulative difference in quality is more evident. Additionally, we have experimented with an adversarialtraining algorithm, ELECTRA (Clark et al., 2020), and ob-tained results slightly better than BERT, yet inferior to SBERT.We omit to report these results here, since they did not provideany further notable insights.

We can also see in Table 5 that the trained modelsare able to generalize across categories; the accu-racy and MCC scores in the random split and cross-category settings for each method are very similar,with only a slight drop in the cross-category setting.This indicates that the nature of the revisions is rel-atively consistent among all categories, yet revealsthe existence of some category-dependent features.To ﬁnd out whether BERT really captures therelative revision quality and not only lexical fea-tures present in the original claim, we introduced a

Single claim baseline, analogous to the hypothesis-only baseline in natural language inference(Poliaket al., 2018). It can be seen that the accuracy andMCC scores are low across all settings (maximumaccuracy of 59.8), which indicates that BERT in-deed captures relative revision quality mostly.

Table 6 lists the results of our ranking experiments,which show patterns similar to the results achievedin the classiﬁcation task.We can observe similar patterns in both of theselected ranking approaches: SBERT consistentlyoutperforms all other considered approaches acrossall settings (up to 0.73 and 0.72 in Pearson’s r andSpearman’s ρ accordingly). BERT and SBERT out-perform SBOW, indicating that transformer-basedmethods are more capable of capturing the rela-tive quality of revisions. While BTL + BERT ob-tains results comparable to BTL + SBERT, we ﬁndthat using the CLS-vector as a sentence embed-ding representation leads to lower results. We pointout, though, that using other sentence embeddingsand/or pooling strategies (for example, averagedBERT embeddings) may further improve results.Similar to the results of the classiﬁcation task,we observe only a slight performance drop in the ask Label Accuracy Instances Type Claim Clariﬁcation 69.7 12 856Typo/Grammar Correction 83.6 12 125Corrected/Added Links 89.3 3 660Changed Meaning of Claim 57.3 232Misc 67.2 2 130None 78.3 45 842Distance Revision distance 1 76.2 42 341Revision distance 2 79.6 17 478Revision distance 3 80.6 8 023Revision distance 4 81.0 3 979Revision distance 5 79.5 2 103Revision distance 6+ 74.9 2 921

All 77.7 76 845

Table 7: Accuracy of the best model, SBERT, on eachsingle revision type and distance in ClaimRev

EXT , alongwith the number of instances per each case. cross-category setting when using BTL for rank-ing, yet an increase when using SVMRank, againemphasizing the topic-independent nature of claimquality in our corpus.

To further explore the capabilities and limitationsof the best model, SBERT, we analyzed its perfor-mance on each revision type and distance.As the upper part of Table 7 shows, SBERT ishighly capable of assessing revisions related to thecorrection and addition of links and supportinginformation. This revision type also obtained thehighest correlations between quality dimensionsand type of revision (see Table 3), which indicatesthat the patterns of changes performed within thistype are more consistent. In contrast, we observethat the model fails to address revisions related tothe changed meaning of a claim. On the one hand,this may be due to the fact that such examples areunderrepresented in the data. On the other hand,the consideration of such examples in the selectedtasks is questionable, since changing the meaningof claim is usually considered as the creation of a new claim and not a new version of a claim.An insight from the lower part of Table 7 is thatthe accuracy of predictions increases from revisiondistance 1 to 4. We obtain better results when com-paring non-consecutive claims than when compar-ing claim pairs with distance of 1. An intuitive ex-planation is that, since each single revision shouldideally improve the quality of a claim, the morerevisions a claim undergoes, the more evident thequality improvement should be. For distances > ,the accuracy starts to decrease again, but this may be due to the limited number of cases given. In this paper, we have proposed a new way of as-sessing quality in argumentation by consideringdifferent revisions of the same claim. This allowsus to focus on characteristics of quality regardlessof the discussed topics, aspects, and stances in argu-mentation. We provide a new corpus of web claims,which is the ﬁrst large-scale corpus to target qual-ity assessment and revision processes on a claimlevel. We have carried out initial experiments onthis corpus using traditional and transformer-basedmodels, yielding promising results but also point-ing to limitations. In a detailed analysis we havestudied different kinds of claim revisions and pro-vided insights into the aspects of a claim that inﬂu-ence the users’ perception of quality. Such insightscould help improve writing support in educationalsettings, or identify the best claims for debatingtechnologies and argument search.We seek to encourage further research on how tohelp online debate platforms automate the processof quality control and design automatic quality as-sessment systems. Such systems can be used to in-dicate if the suggested revisions increase the qualityof an argument or recommend the type of revisionneeded. We leave it for future work to investigatewhether the learned concepts of quality are trans-ferable to content from other collaborative onlineplatforms (such as idebate.org or Wikipedia), or todata from other domains, such as student essaysand forum discussions.

Acknowledgments

We thank Andreas Breiter for feedback on earlydrafts, and the anonymous reviewers for their help-ful comments. This work was partially funded bythe Deutsche Forschungsgemeinschaft (DFG, Ger-man Research Foundation) under project number374666841, SFB 1342.

References

Tazin Afrin and Diane Litman. 2018. Annotationand classiﬁcation of sentence-level revision improve-ment. In

Proceedings of the Thirteenth Workshop onInnovative Use of NLP for Building Educational Ap-plications , pages 240–246, New Orleans, Louisiana.Association for Computational Linguistics.Yamen Ajjour, Henning Wachsmuth, Johannes Kiesel,Martin Potthast, Matthias Hagen, and Benno Stein.019. Data acquisition for argument search: Theargs.me corpus. In

KI 2019: Advances in ArtiﬁcialIntelligence - 42nd German Conference on AI, Kas-sel, Germany, September 23-26, 2019, Proceedings ,pages 48–59.Khalid Al-Khatib, Henning Wachsmuth, Matthias Ha-gen, Jonas Köhler, and Benno Stein. 2016. Cross-domain mining of argumentative text through dis-tant supervision. In

Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies , pages 1395–1404. Associationfor Computational Linguistics.Ralph Allan Bradley and Milton E. Terry. 1952. Rankanalysis of incomplete block designs: I. the methodof paired comparisons.

Biometrika , 39(3/4):324–345.Kevin Clark, Minh-Thang Luong, Quoc V. Le, andChristopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather thangenerators. In

International Conference on Learn-ing Representations .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Esin Durmus, Faisal Ladhak, and Claire Cardie. 2019.The role of pragmatic and discourse context in de-termining argument impact. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5668–5678, Hong Kong,China. Association for Computational Linguistics.Roxanne El Baff, Henning Wachsmuth, KhalidAl Khatib, and Benno Stein. 2020. Analyzing thepersuasive effect of style in news editorial argumen-tation. In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 3154–3160, Online. Association for Computa-tional Linguistics.Martin Gleize, Eyal Shnarch, Leshem Choshen, LenaDankin, Guy Moshkowich, Ranit Aharonov, andNoam Slonim. 2019. Are you convinced? choos-ing the more convincing evidence with a Siamesenetwork. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 967–976, Florence, Italy. Association forComputational Linguistics.Shai Gretz, Roni Friedman, Edo Cohen-Karlik, As-saf Toledo, Dan Lahav, Ranit Aharonov, and NoamSlonim. 2019. A large-scale dataset for argument quality ranking: Construction and analysis. arXivpreprint arXiv:1911.11408 .Ivan Habernal and Iryna Gurevych. 2015. Exploit-ing debate portals for semi-supervised argumenta-tion mining in user-generated web discourse. In

Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages 2127–2137. Association for Computational Linguistics.Ivan Habernal and Iryna Gurevych. 2016. Which ar-gument is more convincing? analyzing and predict-ing convincingness of web arguments using bidi-rectional LSTM. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1589–1599, Berlin, Germany. Association for Computa-tional Linguistics.Thorsten Joachims. 2006. Training linear svms in lin-ear time. In

Proceedings of the 12th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining , KDD ’06, page 217–226, NewYork, NY, USA. Association for Computing Machin-ery.Christian Kock. 2007. Dialectical obligations in politi-cal debate.

Informal Logic , 27(3):233–247.Anne Lauscher, Lily Ng, Courtney Napoles, and JoelTetreault. 2020. Rhetoric, logic, and dialectic: Ad-vancing theory-based argument quality assessmentin natural language processing. In

Proceedingsof the 28th International Conference on Compu-tational Linguistics , pages 4563–4574, Barcelona,Spain (Online). International Committee on Compu-tational Linguistics.R Duncan Luce. 2012.

Individual choice behavior: Atheoretical analysis . Courier Corporation.Stephanie Lukin, Pranav Anand, Marilyn Walker, andSteve Whittaker. 2017. Argument Strength is in theEye of the Beholder: Audience Effects in Persuasion.In

Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Volume 1, Long Papers , pages 742–753.Association for Computational Linguistics.B.W. Matthews. 1975. Comparison of the pre-dicted and observed secondary structure of t4 phagelysozyme.

Biochimica et Biophysica Acta (BBA) -Protein Structure , 405(2):442 – 451.Chandler May, Alex Wang, Shikha Bordia, Samuel R.Bowman, and Rachel Rudinger. 2019. On measur-ing social biases in sentence encoders. In

Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 622–628, Minneapo-lis, Minnesota. Association for Computational Lin-guistics.saac Persing and Vincent Ng. 2013. Modeling the-sis clarity in student essays. In

Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages260–269, Soﬁa, Bulgaria. Association for Computa-tional Linguistics.Isaac Persing and Vincent Ng. 2015. Modeling argu-ment strength in student essays. In

Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers) , pages 543–552, Beijing,China. Association for Computational Linguistics.Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language in-ference. In

Proceedings of the Seventh Joint Con-ference on Lexical and Computational Semantics ,pages 180–191, New Orleans, Louisiana. Associa-tion for Computational Linguistics.Peter Potash, Robin Bhattacharya, and AnnaRumshisky. 2017. Length, interchangeability,and external knowledge: Observations from predict-ing argument convincingness. In

Proceedings of theEighth International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers) ,pages 342–351, Taipei, Taiwan. Asian Federation ofNatural Language Processing.Peter Potash, Adam Ferguson, and Timothy J. Hazen.2019. Ranking passages for argument convincing-ness. In

Proceedings of the 6th Workshop on Argu-ment Mining , pages 146–155, Florence, Italy. Asso-ciation for Computational Linguistics.Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages3982–3992, Hong Kong, China. Association forComputational Linguistics.Ruty Rinott, Lena Dankin, Carlos Alzate Perez,Mitesh M. Khapra, Ehud Aharoni, and NoamSlonim. 2015. Show me your evidence - an au-tomatic method for context dependent evidence de-tection. In

Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing ,pages 440–450, Lisbon, Portugal. Association forComputational Linguistics.Edwin Simpson and Iryna Gurevych. 2018. Findingconvincing arguments using scalable Bayesian pref-erence learning.

Transactions of the Association forComputational Linguistics , 6:357–371.Christian Stab and Iryna Gurevych. 2017. Recognizinginsufﬁciently supported arguments in argumentativeessays. In

Proceedings of the 15th Conference ofthe European Chapter of the Association for Compu-tational Linguistics: Volume 1, Long Papers , pages 980–990, Valencia, Spain. Association for Computa-tional Linguistics.Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2016. Winningarguments: Interaction dynamics and persuasionstrategies in good-faith online discussions. In

Pro-ceedings of the 25th International Conference onWorld Wide Web , WWW ’16, page 613–624, Repub-lic and Canton of Geneva, CHE. International WorldWide Web Conferences Steering Committee.Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, RoniFriedman, Elad Venezian, Dan Lahav, Michal Ja-covi, Ranit Aharonov, and Noam Slonim. 2019a.Automatic argument quality assessment - Newdatasets and methods. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5625–5635. Associationfor Computational Linguistics.Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, RoniFriedman, Elad Venezian, Dan Lahav, Michal Ja-covi, Ranit Aharonov, and Noam Slonim. 2019b.Automatic argument quality assessment - newdatasets and methods. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5625–5635, Hong Kong,China. Association for Computational Linguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Henning Wachsmuth, Nona Naderi, Yufang Hou,Yonatan Bilu, Vinodkumar Prabhakaran, Tim Al-berdingk Thijm, Graeme Hirst, and Benno Stein.2017a. Computational argumentation quality assess-ment in natural language. In

Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 1, LongPapers , pages 176–187, Valencia, Spain. Associa-tion for Computational Linguistics.Henning Wachsmuth, Benno Stein, and Yamen Ajjour.2017b. “PageRank” for argument relevance. In

Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 1, Long Papers , pages 1117–1127,Valencia, Spain. Association for Computational Lin-guistics.Henning Wachsmuth and Till Werner. 2020. Intrin-sic quality assessment of arguments. In

Proceed-ings of the 28th International Conference on Com-putational Linguistics , pages 6739–6745, Barcelona,pain (Online). International Committee on Compu-tational Linguistics.Zhongyu Wei, Yang Liu, and Yi Li. 2016. Is this postpersuasive? ranking argumentative comments in on-line forum. In

Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 195–200, Berlin,Germany. Association for Computational Linguis-tics.Fan Zhang, Homa B. Hashemi, Rebecca Hwa, and Di-ane Litman. 2017. A corpus of annotated revisionsfor studying argumentative writing. In