Belief-based Generation of Argumentative Claims
Milad Alshomary, Wei-Fan Chen, Timon Gurcke, Henning Wachsmuth
BBelief-based Generation of Argumentative Claims
Milad Alshomary Wei-Fan Chen Timon Gurcke Henning Wachsmuth
Department of Computer SciencePaderborn University, Paderborn, Germany
Abstract
When engaging in an argumentative discourse,skilled human debaters tailor claims to the be-liefs of the audience, to construct effective ar-guments. Recently, the field of computationalargumentation witnessed extensive effort to ad-dress the automatic generation of arguments.However, existing approaches do not performany audience-specific adaptation. In this work,we aim to bridge this gap by studying the taskof belief-based claim generation:
Given a con-troversial topic and a set of beliefs, generate anargumentative claim tailored to the beliefs. Totackle this task, we model the people’s priorbeliefs through their stances on controversialtopics, and extend state-of-the-art text genera-tion models to generate claims conditioned onthe beliefs. Our automatic evaluation confirmsthe ability of our approach to adapt claims toa set of given beliefs. In a manual study, weadditionally evaluate the generated claims interms of informativeness and their likelihoodto be uttered by someone with a respective be-lief. Our results reveal the limitations of mod-eling users’ beliefs based on their stances, butdemonstrate the potential of encoding beliefsinto argumentative texts, laying the ground forfuture exploration of audience reach.
According to van Eemeren and Houtlosser (1999),debaters engaging in an argumentative discourse,aimed to resolve disagreement, design their nextargumentative move considering the topical poten-tial, the audience demand, and appropriate presen-tational devices. Feinberg and Willer (2015) stressbased on the moral foundation theory (Godden,2010) how phrasing arguments to fit the audience’smorals leads to a better agreement. For exam-ple, in a debate on former US president
DonaldTrump , potential topics could have been immigra-tion , health care plans , tax plans , etc. However, knowledge about the audience being middle-classworkers would have suggested to restrict the selec-tion to Trump’s tax plans . An appropriate usageof presentational devices may have then put a conargument as follows:
Example “Donald Trump was a bad president.He did nothing but hurt the poor and middle class,his tax plan benefited only rich people who couldafford it.”
There is a recent growth of interest in argumentgeneration as a subfield of computational argu-mentation. Several tasks have been proposed, in-cluding claim negation (Bilu et al., 2015; Hideyand McKeown, 2019), counterargument genera-tion (Hua et al., 2019), and conclusion generation(Alshomary et al., 2020). While some researchconsiders argumentative strategies when deliveringarguments (Wachsmuth et al., 2018; El Baff et al.,2019), no one has worked on adapting argumentsto user beliefs yet. Our goal is to bridge this gap.In this work, we propose to extend argumentgeneration technologies with the ability to encodebeliefs. This does not only better reflect the pro-cess by which humans reason, but it also allowscontrolling the output, in order to better reach theaudience. In particular, we introduce the task of belief-based claim generation:
Given a controver-sial topic and a representation of a user’s beliefs,generate a claim that is both relevant to the topicand matches the beliefs.To approach this task, we first model user beliefsby their stances (pro or con) on a set of controver-sial topics, and then extend two state-of-the-art textgeneration approaches by conditioning their outputon a specific set of beliefs. One approach builds onLi et al. (2016), equipping a sequence-to-sequence(Seq2Seq) model with a context vector represent-ing the given stances. The other approach controlsthe output of a pre-trained argumentative language a r X i v : . [ c s . C L ] J a n odel (LM) using the algorithm of Dathathri et al.(2020) to assure resembling the user’s beliefs. Westudy the given task empirically on the debate.org dataset of Durmus and Cardie (2018). The datasetcontains users’ arguments on various controversialtopics as well as their stances towards the mostpopular topics on the website, called the big issues .For our purposes, we use these big issues as thecontroversial topics, and we model beliefs by theuser’s stances towards them.In our automatic evaluation, we compare bothmodels against their unconditioned correspondents(i.e., the same models without knowledge abouta user). We assess the generated claims in termsof the similarity to the ground truth and the likeli-hood of carrying textual features that reflect users’stances on big issues. Our results suggest that usingusers’ beliefs significantly increases the effective-ness of the Seq2Seq and LM in most cases. More-over, a stance classifier trained on claims generatedby the conditioned LM achieves the best averagedaccuracy across all big issues.In a subsequent manual evaluation, we find thatclaims generated by the conditioned LM are moreinformative regarding the topic. In terms of predict-ing stance from generated claims, we analyze thelimitations of our approach in detail, which lie inthe belief encoding step. By avoiding these limita-tions, we find that the generated claims enable theannotators to predict correctly a stance on a givenbig issue in 45% of the cases (26% incorrectly).These results demonstrate the applicability of en-coding a user’s beliefs into argumentative texts, en-abling future research on the effect of belief-basedargumentative claims on audiences.The contribution of this work is threefold :• A new task, belief-based claim generation .• An approach to model and match users’ be-liefs in the generation of arguments.• Empirical evidence of the applicability of en-coding beliefs into argumentative texts. Early research on argument generation aimed tocreate argumentative texts starting from a symbolicrepresentation (Zukerman et al., 2000; Grasso et al.,2000; Carenini and Moore, 2006). Conceptually, those approaches all had a similar architecture con-sisting of three main phases: text planning, sen-tence planning, and realization (Stede et al., 2018).While they included a user model to a certain ex-tent and aimed to generate convincing arguments,they were still performed on a limited scale.With the tremendous advances of NLP and ma-chine learning since then, research has begun toaddress different tasks in the realm of argumentgeneration, showing promising results. Hua et al.(2019) proposed a neural network-based frameworkfor generating counter-arguments. Both Bilu et al.(2015) and Hidey and McKeown (2019) addressedthe task of claim negation, using a rule-based anda neural approach respectively. Also, Sato et al.(2015) proposed an approach to argument gener-ation based on sentence retrieval, in which, givena topic, a set of paragraphs covering different as-pects is generated. However, these approaches areagnostic to the target audience.Chen et al. (2018) modified the political bias of(often claim-like) news headlines using style trans-fer, accounting for general political sides (left andright) at least. Moreover, Wachsmuth et al. (2018)modeled rhetorical strategies in argument synthe-sis conceptually, but its computational realization(El Baff et al., 2019) considers the audience im-plicitly only, using a language model approach toselect and arrange argumentative discourse unitsthat are phrased in an argument.In the field of conversational AI, researchershave utilized machine translation techniques totackle the task of dialog generation (Ritter et al.,2011). Li et al. (2016) worked on augmentingsequence-to-sequence models by learning personavectors from the given data. In a similar fashion,one of our approaches extends such a model by acontext vector representing a user’s belief. Here,however, we deal with argumentative text.Progress in the field of text generation has beenmade due to the availability of large pre-trained lan-guage models (Devlin et al., 2018; Solaiman et al.,2019). While these models excel in generating co-herent texts, ensuring a generated text possesses acertain property is not straightforward. Some re-search tackled this limitation, offering ways to bet-ter control the output (Keskar et al., 2019; Ziegleret al., 2019). One of the most flexible of such ap-proaches is by Dathathri et al. (2020), which doesnot require fine-tuning for each controlling theme.Their algorithm conditions the output of a languageodel to contain certain properties defined by a dis-criminative classifier or a bag-of-words. One of ourapproaches makes use of this algorithm to condi-tion the output of an argumentative language modelon a bag-of-words that represents a user’s beliefs. Arecent relevant work by Schiller et al. (2020) dealswith the generation of aspect-controlled arguments.Similar to us, the authors utilize a pre-trained lan-guage model to generate arguments on a specifictopic, with a controlled stance and aspect. Theirfocus is on topical aspects of arguments, though,and their approach based on Keskar et al. (2019) islimited to a predefined set of topics and aspects. Due to the importance of audience in argumenta-tion when aiming for persuasiveness (van Eemerenand Houtlosser, 1999), and due to the fact thathumans comply to certain morals that shape theirbeliefs and affect their reasoning (Godden, 2010;Feinberg and Willer, 2015), we introduce the audi-ence’s beliefs as a new dimension to the argumentgeneration process in this work. For this, we pro-pose a new task, belief-based claim generation:
Given a controversial topic and a repre-sentation of the audience’s beliefs, gen-erate a claim that is both relevant to thetopic and matches the beliefs.
We focus this task on generating claims ratherthan full arguments to keep it simple and becauseclaims denote the main units from which argumentsare built. As shown by Feinberg and Willer (2015),better agreement is achieved when arguments areframed with respect to audience’s beliefs. There-fore, we argue that studying the mentioned taskwill enable argumentation technology, knowing itsaudience, to generate more convincing arguments,bridging the gap between disagreeing parties.
To study the proposed task, a dataset is needed inwhich information about users revealing their be-liefs as well as their arguments on various topics aregiven. Here, we build upon the dataset introducedby Durmus and Cardie (2018), which was collectedfrom debate.org , an online platform where userscan engage in debates over controversial topics andshare their profiles. The dataset contains users’arguments as answers to topic questions and en-gagement in debates, along with various user in-formation, including a user’s self-specified stances
Dataset
Training set 41 288 22 241 5 189Validation set 5 028 2 450 2 509Test set 5 154 2 728 2 512Full dataset 51 470 27 419 5 189
Table 1: Number of claims, topics, and users in each ofthe training, validation, and test set of the data used inthis paper. (pro or con) on up to 48 predefined popular contro-versial topics, called big issues .In our dataset, for the task at hand, we keep onlyusers who have at least three arguments and statedtheir stance on at least one of the big issues. Forthose, we collected their arguments along with thetopics and stances. In total, the dataset containsaround 51k claims, on 27k topics from 5k users.We randomly split the dataset per topic into 10%test and 90% training. 10% of the latter are used asthe validation set. Statistics are given in Table 1.To develop approaches to the belief-based claimgeneration task, we need training data where claimscan be identified as such. Since claim detectionis not our focus, we preprocess all data usingthe claim detection approach of Chakrabarty et al.(2019). In particular, we score the likelihood ofeach sentence being a claim, and only keep the onewith the highest score as the user’s claim on thetopic. To evaluate the model, we created a sam-ple of 100 arguments, and two annotators decidedwhether the extracted sentence represents a claimon the given topic or not. In terms of full agree-ment, the model extracted claims correctly in 81%of the cases, the Cohen’s κ inter-annotator agree-ment being 0.3. We note that this preprocessingstep produces some noise in the data, mainly affect-ing the training of our Seq2Seq model below. To study our research question, we propose andcompare two approaches that build on top of knowntechniques for text generation. Both approachesrely on modeling users’ beliefs via their stances onbig issues. The first is an extension of the Seq2Seqmodel (Sutskever et al., 2014), where the user’sstances are encoded as a context vector, while thesecond conditions the output of a pre-trained ar-gumentative language model via a bag-of-words,constructed based on stances on big issues. .1 Seq2Seq-based Model
Given a topic, as a sequences of words T =( w , w , ..., w n ) , a user vector −→ U ∈ { , } k with k being the number of big issues, and a claim asa sequence of words C = ( w , w , ..., w m ) , firstan LSTM-based encoder consumes the input topicand produces a hidden state −→ h , which is used toinitialize the LSTM-based decoder. The user vector −→ U is projected into a new embedding space via afeed forward network with a learned weight matrix W U , producing a new vector, −→ V : −→ V = σ ( W U · −→ U ) Following Li et al. (2016), our −→ V is served astheir speaker embedding in the model. The differ-ence between the speaker model in Li et al. (2016)and this model is that the vector −→ V is not explicitlypredefined but rather learned from the data, whilein our model it is already predefined as a binaryvector representing the user’s stances on big issues.By augmenting the Seq2Seq model with a contextuser vector, the model is supposed to capture thecorrelation between users’ stances on big issue andthe corresponding claims. Once the correlation islearned, the model can generate a claim utilizingnot only the topic, but also the stances on big issuesof the target user, which reflect the beliefs. In this approach, we represent a user’s stances onbig issues as a bag-of-words. We then use thetopic as a prompt for a pre-trained argumentativelanguage model (LM) to synthesize a claim condi-tioned using the algorithm of Dathathri et al. (2020).The synthesis process is illustrated in Figure 1.
Argumentative Language Model
Since we aimto generate claims in particular, a standard LM isnot enough. To model argumentative language, wetake a LM pre-trained on general language and fine-tune it on a large set of arguments (in our experi-ments, we use the corpus of Ajjour et al. (2019)).The result is an LM that is able to generate argu-mentative text.
Belief-based Bag-of-words
Next, we build abag-of-words that represents the beliefs of a user.We learn this from the user’s stances on the bigissues. For example, a user pro abortion wouldlikely be pro choice . Hence, words such as right and choice are candidates to be included in their
Big-issues Bag-of-wordsLanguage Model Conditioning (4) Re-compute(2) Forwardpass (3) Update past LM Whaling is ais a ... E n v i r on m en t P r o t e c t i on T o r t u r e G l oba l W a r m i ng (1) Build U bow LM LM U bow p(U bow |x) sport ~ p t+1 cruel ~ p t+1
Figure 1: The synthesis process of the conditioned LMon the topic “Whaling”, given a user who is pro environ-mental protection and global warming and con torture .Steps: (1) Building U bow , based on stances (2) Forwardpass through the LM to generate a token, sport (3) Up-dating the LM history H t , based on p ( U bow | x ) , and (4)Generating from the new history ˆ H t a new token cruel . belief-based bag-of-words. To this end, we firstbuild two bag-of-words representations for eachbig issue, one for the pro and for the con side. For auser, we then construct a belief-based bag-of-wordsbased on their stances on big issues.To build a representative pro and con bag-of-words for each big issue, we follow the topic sig-nature approach of Lin and Hovy (2000). Givena big issue, we first collect from some corpus ofarguments three sets: relevant pro arguments R pro ,relevant con arguments R con , and a random setof non-relevant arguments ˆ R . For each relevantset, we then compute a likelihood ratio for all itswords with respect to ˆ R and keep only words witha score higher than a specific threshold τ , resultingin two sets of words, W pro and W con . Since a wordmay appear in both sets, we remove it from the setwhere it occurs fewer times. Finally, we sort wordsaccording to their likelihood ratio and keep in both W pro and W con the top k words, forming the finalpro and con bag-of-words respectively. Claim Generation
Given a user (represented bystances on big issues) and a topic, we construct abelief-based bag-of-words (Step 1 in Figure 1): U bow = W ∪ W ∪ . . . ∪ W n where W i is the pro bag-of-words if the stance ispro and the con bag-of-words otherwise. Then,we use the topic as a prompt and the user’s bag-of-words U bow to condition the generated claim(see Figure 1). In particular, given a transformer-ased LM (Vaswani et al., 2017), a token x t +1 isgenerated at each time step as follows: o t +1 , H t +1 = LM ( x t , H t ) x t +1 ∼ p t +1 = Sof tmax ( W · o t +1 ) where H t represents the history of the LM. Usingthe algorithm of Radford et al. (2019), called Plugand Play LM (PPLM), an update to the past, ∆ H ,is computed to control the generated claim, basedon the sum of the log likelihood p ( U bow | x ) of allwords in the belief-based bag-of-words. Then thenew history, ˆ H t = H t + ∆ H t , is used as in theprevious equations to draw a new distribution ˆ p t +1 ,of which a new token is sampled. To ensure fluencyin the generated text, ∆ H is further modified toensure a high log-likelihood p ( x ) with respect tothe LM. More details on the algorithm can be foundin the work of Radford et al. (2019).In short, through fine-tuning an LM on argumen-tative text, we tune it to generate claims. Usingthe topic as a prompt, we ensure that the claim ison the topic. Finally, the PPLM represents beliefs,modeled as a bag-of-words U bow , in the claim. In this section, we evaluate whether utilizing user’sbeliefs as input, modeled as stances on big issues,leads to claims that better match the ground-truthclaims and reveal the input stances on big issues.
On one hand, we compute the BLEU and ME-TEOR scores of the generated claims with respectto the ground-truth claims. On the other hand, wecompute the likelihood that the generated claimspossess textual features that reflect the input user’sbeliefs. We do so by measuring the accuracy ofpredicting user’s stances on big issues given thegenerated claims. We compute this accuracy foreach of the 48 big issues individually and reportthe results for all of them. To this end, we carry outthe following three steps for a given approach.First, we generate claims for all given users andtopics in the test dataset. Second, we keep onlyinstances in which users have a stance (pro/con) onthe tested big issue, and split the filtered datasetinto training and test. Finally, we train a simpleTF-IDF based linear classifier on the training set topredict the stance on the big issue given the text ofthe claim. The accuracy of the classifier on the test
Approach BLEU-1 BLEU-3 METEOR
S2S-baseline 18.2% 0.44%
S2S-model * * LM-baseline 9.6% * * Table 2: BLEU and METEOR scores of the claims ofeach evaluated approach compared to the ground-truthclaims. Values marked with * are significantly betterthan the respective baseline at p < .05 (student’s t -test). split then quantifies the likelihood of the generatedclaims possessing textual features that reflect thestance on the corresponding big issue. In the following, we give implementation details ofour approaches and the corresponding baselines:
Seq2seq-based Model
Based on the OpenNMTframework (Klein et al., 2017), the encoder and de-coder are each two-layer LSTMs of hidden size 512with GloVe word embeddings of size 300. Users’stances on big issues are represented as a one-hotencoded vector, and then projected into 16 dimen-sions space through a one-layer dense neural net-work. We trained the model with the Adagrad opti-mizer (batch size 16) and refer to it as
S2S-model . Conditioned Language Model
We constructedthe pro/con relevant argument sets ( R pro , R con ) byquerying the respective big issue from the APIprovided by Ajjour et al. (2019) and extractingpro/con arguments from the top 60 results. For thenon-relevant argument set ( ˆ R ), we used the samecorpus (Ajjour et al., 2019) and randomly selected100 arguments. We eliminated all words with ascore under τ = 10 and finally kept the top k = 25 words from each set ( R pro , R con ) to represent thebag-of-words. To model the argumentative language, we fine-tuned the GPT-2 model on the corpus of Ajjouret al. (2019), which contains around 400k argu-ments. The fine-tuning was performed using thetransformers framework (Wolf et al., 2019). Weused the topic as a prompt to trigger the generationprocess. However, since some topics are phrasedas a question (e.g, “is abortion wrong?”), we ex-tracted the noun phrase from the topic and used itas a prompt. For conditioning the generated claim, We refrained from tuning the parameters here since wedo not have a ground truth. eath Gay Drug Global Environm. Medical Smok. Minim. Border All 48Approach Abortion penalty Marriage legaliz. warming protection mariju. ban wage fence big issues
Ground-truth 0.49 0.59 0.55 0.55 0.55 0.55 0.50 0.53 0.48 0.62 0.52S2S-baseline 0.49 0.48 * * * Table 3: Accuracy of each classifier trained on claims generated by the evaluated approaches to predict the stance,on the 10 most frequent big issues as well as on average over all 48 big issues. Values marked with * are signifi-cantly better than corresponding baseline at p < . according to a one-tailed Student’s t -test. we used the PPLM implementation (Dathathri et al.,2020) . We call this model the LM-conditioned . Baselines
To evaluate the gain of encoding user’sbeliefs, we compare our two approaches to the cor-responding version without stances on big issuesas an input. We refer to these baselines as
S2S-baseline and
LM-baseline respectively . Table 2 shows the results of our approaches and thebaselines in terms of BLEU and METEOR.For
S2S , the BLEU scores of our approach aresignificantly better than the baseline. The
LM-conditioned is significantly better than the base-line version in terms of BLEU-1 and METEOR.In general, the S2S-model has the highest scoresacross all measures. The reason may be that it wastrained in a supervised manner on the given dataset,whereas the
LM-model was only fine-tuned in anunsupervised way on a different argument corpus.Regarding the encoding of user stances, Table 3shows the accuracy of a linear classifier trained topredict the stance from the claims generated byeach approach as well as from the ground-truth, onaverage and on the 10 most frequent big issues. Acomplete table with all big issues can be found inthe appendix.The best average accuracy across all the big is-sues is achieved by the
LM-model (0.54). Com-pared to the corresponding baselines, the LM-model and the
S2S-model generated claims thatboosted the accuracy of the stance classifier on33 (69%) and 21 (44%) of all big issues respec- step-size=0.15 and the repetition-penalty=1.2 A baseline that uses the corresponding bag-of-words ofthe targeted topic to guide the generation wouldn’t be valid,since we don’t have information on the user’s stance on thistargeted topic. tively. Overall, in 20 of the big issues, the bestaccuracy was achieved on the claims generated bythe conditioned LM, compared to only nine bigissues for the S2S-model. This indicates that theLM-conditioned can better encode a user’s beliefs,modeled as stances on big issues, into generatedclaims.
To obtain more insights into belief-based claim gen-eration, we let users manually evaluate the outputof the given approaches. Upon inspecting a sampleof generated claims by our approaches, we noticedthat the
LM-conditioned produces more fluent andinformative texts. Accordingly, we focused on theLM-conditioned and its baseline in the evaluation,where we conducted two user studies. The goalof the first was to assess the quality of the big-issue bag-of-words collected automatically, whilethe second targeted the output of the LM-model,its baseline, and a variant that utilizes a manuallyrefined bag-of-words.
To keep the manual annotation effort manageable,we evaluated only the top-10 big issues. Two au-thors of this paper categorized each word in thepro/con bag-of-words of the corresponding big is-sue into five categories, c1–c5 :c1: Word irrelevant to the big issue.c2: Relevant word, wrong stance.c3: Relevant word, both stances possible.c4: Relevant word, correct stance.c5: Very relevant word, correct stance. verall Relatedness Level 4 Relatedness Level 3 Relatedness Level-2Approach True False Undec. True False Undec. True False Undec. True False Undec.
LM-baseline 44% 34%
50% 50%
55% 31% 14%
LM-conditioned 37% 32% 31% 35% 38% 27% 59% 41%
45% 26%
50% 31%
61% 28%
11% 25% 18% 56%Ground Truth 42% 30% 28% 38% 42% 19% 64% 27% 9% 27% 19% 54%
Table 4: Manual Evaluation: Percentage of cases for each approach where the majority of annotators predictedthe stance of a generated claim on the given big issue correctly (true), incorrectly (false), or could not decide it(Undec.). The overall scores and those for each topic/big-issue relation level are listed.
Irrelevant Relevant Very RelevantWords c1 c2 c3 c4 c5
Pro 14% 10% 36% 34% 6%Con 36% 2% 34% 26% 2%
Table 5: Distribution of the pro/con bag-of-words, aver-aged across the top-10 big issues, over the five consid-ered categories: c2 means wrong stance, c3 words thatfit both stances, and c3 and c4 represent correct stance.
Examples can be found in the appendix. Tocompute inter-annotator agreement, three big is-sues were annotated by both annotators, resultingin Cohen’s κ of 0.45, reflecting moderate agree-ment. Afterwards, only one annotator continuedthe annotations for the other big issues.Table 5 shows the distribution of words over cat-egories, averaged across the 10 big issues. For thepro bag-of-words, around 40% of the words arerelevant and reflect the right stance, while 36% arerelevant but could be used in arguments from bothstances. For the con bag-of-words, however, thepercentages are lower (28% and 34% respectively).A considerable proportion of words belong to cat-egories c1 and c2, which creates noise that couldconfuse the conditioning process of the LM. Hence,we also consider a variant of the conditioned LMthat uses only relevant words from c4 and c5. We evaluate the effectiveness in terms of whethera given generated claim reveals the stance of thegiven user on a specific big issue as well as howinformative the claim is regarding the given topic.Since not all topics are directly related to the bigissues that can be revealed in the generated claims,we manually annotated the relatedness of the topfrequent 200 topics in the test dataset to the mostfrequent 10 big issues, and created the evaluationsample accordingly. In particular, two authors of this paper scored the relatedness of each pair oftopics and big issues on a scale from 1 to 4:4: Topic and big issue are the same. Example: "gay marriage should be legalized" and "gaymarriage"
3: A stance on the topic likely affects the stanceon the big issue. Example: "killing domesticabusers" and "death penalty"
2: A stance on the topic may affect the stanceon the big issue. Example: "morality" and"abortion"
1: Topic and big issue are not related. Example: "do aliens exist?" and "abortion"
The two annotators had a Cohen’s κ agreementof 0.54. Around 97.4% of all pairs got score 1,1.1% score 2, 0.8% score 3, and 0.7% score 4.The small percentage of cases that can be evalu-ated reflects a limitation in the designed evaluationstudy. However, it still allows us to evaluate theeffectiveness of our approach for different levelsof relatedness. Given the annotated pairs, we ran-domly selected 10 pairs from levels 2, 3, and 4each. For each pair, we then collected all claimson the topic from the test set, where the authorspecifies a stance on the corresponding big issue.We randomly select 30 claims each, resulting in anevaluation sample of 90 instances.We used the crowdsourcing platform MTurk for evaluation. For each instance, we showed atopic, a claim, and the corresponding big issue tothree annotators. The annotators had to performtwo tasks: (1) to predict the stance of the user onthe corresponding big issue from the text of theclaim, and (2) to rate the claim’s informativenessregarding the topic on a scale from 1 to 3.Table 4 shows the percentage of cases in whichthe majority of annotators predicted the stance cor-rectly (true), incorrectly (false), or could not decide pproach Overall Level 4 Level 3 Level 2 LM-baseline 1.8
LM-cond. (manual) 2.0 2.3 2.2 1.5Ground Truth 2.0 1.9 1.8 2.2
Table 6: Manual Evaluation: Mean informativeness ofthe claims generated by each approach with regard tothe topic (1–3, higher is better). The overall scores andthose for each topic/big-issue relation level are listed. about the stance (undec.) from the generated claim.Across the whole sample (Overall), the claims gen-erated by
LM-conditioned (manual) , the model con-ditioned on the refined bag-of-words, most oftenallowed to predict the stance correctly (45%). Wethus attribute the low effectiveness of the
LM-model to the noise generated by the automatic collectionof big-issues’ bag-of-words, especially seeing thatthe effectiveness gets better across all levels wheneliminating this noise.Analyzing each relatedness level individuallyyields more insights. For relatedness level 4, wherethe topic is the same as the big issue, the
LM-conditioned (manual) generated claims where themajority of the cases with known stance were cor-rect (63%). In level 3, we observe that both ver-sions of our approach outperform the baseline inproducing claims that express the correct stanceon the corresponding big issue with percentagesof 59% and 68% respectively. Finally, at relationlevel 2, which represents a weak relation betweentopics and big issues, predicting the stance seemsto become hard, as indicated by high percentagesof undecided cases. We believe that the weak re-latedness made the annotators guess the stance insome cases, leading to unreliable annotations.Table 6 shows the average score of all ap-proaches regarding the informativeness of the gen-erated claims. Here, both versions of our approachachieved better scores than the baseline, matchingthe ground-truth score. We believe that the lowscores of the ground-truth claims stem from thenoise generated in the claim detection step.
Error Analysis
Table 7 shows some cases fromour evaluation. In case (a), even though the modelgenerated a claim focusing on fundamental rights ,the majority of the annotators was undecided aboutthe stance. Case (b) shows the effect of noise in thebag-of-words: the generated claim reflects a prostance on abortion, while the ground-truth stance is (a)
Topic: is abortion ok
Big issue: (Con) Abortion
Level: Approach Claim Stance
LM-cond. abortion rights groups argue that theright to abortion is a fundamental hu-man right. Undec.LM-baseline abortionists are not the only ones whoare against abortion. There are manyother people who are against abortion Undec.(b)
Topic: abortion is right or wrong
Big issue: (Con) Abortion
Level: Approach Claim Stance
LM-cond. abortion right or wrong is a personalchoice. I believe that abortion shouldbe legal because it’s the woman"s bodyand she has every right to do what evershe wants. ProLM-cond.(manual) abortion right or wrong. I believe thatabortion is murder and should be illegalin all cases except when the mother’slife is at stake, rape, incest etc. Con(c)
Topic: whaling
Big issue: (Pro) Environmental Protection
Level: Approach Claim Stance
LM-cond. whaling is a very dangerous and cruelpractice. It causes great harm to theenvironment, animals that are caughtin it have their lives cut short by beingeaten alive. ProLM-baseline whaling is a sport. It is a sport becauseit involves a lot of physical exertion andskill. Undec.(d)
Topic: should murder be legalized?
Big issue: (Pro) Death Penalty
Level: Approach Claim Stance
LM-cond. should murder be legalized? I thinkthat abortion should not be legal. Abor-tion is killing a baby, and it’s wrong tokill an innocent human being. Undec.LM-baseline should murder be legalized? I thinkso. I think that it should be legalizedbecause it is a good thing Undec.
Table 7: A selection of claims generated by the differ-ent evaluated approaches for the different associationlevels between topic and big issue discussed in the text. con. This is avoided in the claim generated by
LM-conditioned (manual) . Case (c) shows a workingexample of which our approach correctly generateda claim on whaling from an environmental perspec-tive when conditioned as such. Case (d) is a level 2example, indicating limitation in our evaluation,namely, the generated claim reveals a stance onabortion, but we asked about death penalty.
Conclusion
In this paper, we have proposed to equip argumentgeneration technology with the ability to encode be-liefs for two reasons: first, it reflects the human pro-cess of synthesizing arguments, and second, it givesmore control on the generated arguments leadingto a better reach of the audience. For this purpose,we have presented the task of belief-based claimgeneration. Concretely, we studied the researchquestions of how to model a user’s beliefs as wellas how to encode them when generating an argu-mentative text. We have modeled users’ beliefs viatheir stances on big issues, and used them as anextra input in our approaches.Our automatic evaluation has provided evidenceof the applicability of encoding beliefs into argu-mentative texts. In manual studies, we found thatlimitations in the effectiveness of our approachstem from noise produced by the automatic collec-tion of a bag-of-words. The findings of this paperlay the ground to investigate the role of beliefs ingenerating arguments that reach their audience.We point out that ethical issues arise, when tun-ing arguments to affect specific people, such asattempts to manipulate them. While the task andsettings considered here are rather too fundamentalto already make these issues critical, future workshould pay attention to them. Our goal is to developsystems that bring people together.
Acknowledgments
We thank the anonymous reviewers for their help-ful feedback. This work was partially supportedby the German Research Foundation (DFG) withinthe Collaborative Research Center “On-The-FlyComputing” (SFB 901/3) under the project num-ber 160364472.
References
Yamen Ajjour, Henning Wachsmuth, Johannes Kiesel,Martin Potthast, Matthias Hagen, and Benno Stein.2019. Data acquisition for argument search: Theargs.me corpus. In
Proceedings of the 42nd Editionof the German Conference on Artificial Intelligence ,page 48?59.Milad Alshomary, Shahbaz Syed, Martin Potthast, andHenning Wachsmuth. 2020. Target inference inargument conclusion generation. In
Proceedingsof the 58th Annual Meeting of the Association forComputational Linguistics , pages 4334–4345, On-line. Association for Computational Linguistics. Yonatan Bilu, Daniel Hershcovich, and Noam Slonim.2015. Automatic claim negation: Why, how andwhen. In
Proceedings of the 2nd Workshop on Ar-gumentation Mining , pages 84–93, Denver, CO. As-sociation for Computational Linguistics.Giuseppe Carenini and Johanna D Moore. 2006. Gen-erating and evaluating evaluative arguments.
Artifi-cial Intelligence , 170(11):925–952.Tuhin Chakrabarty, Christopher Hidey, and KathleenMcKeown. 2019. Imho fine-tuning improves claimdetection. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages558–563.Wei-Fan Chen, Henning Wachsmuth, Khalid Al Khatib,and Benno Stein. 2018. Learning to flip the bias ofnews headlines. In
Proceedings of the 11th Inter-national Conference on Natural Language Genera-tion , pages 79–88. Association for ComputationalLinguistics.Sumanth Dathathri, Andrea Madotto, Janice Lan, JaneHung, Eric Frank, Piero Molino, Jason Yosinski, andRosanne Liu. 2020. Plug and play language mod-els: A simple approach to controlled text generation.In
International Conference on Learning Represen-tations .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Esin Durmus and Claire Cardie. 2018. Exploring therole of prior beliefs for argument persuasion. In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 1035–1045, NewOrleans, Louisiana. Association for ComputationalLinguistics.Frans H van Eemeren and Peter Houtlosser. 1999.Strategic manoeuvring in argumentative discourse.
Discourse studies , 1(4):479–497.Roxanne El Baff, Henning Wachsmuth, KhalidAl Khatib, Manfred Stede, and Benno Stein. 2019.Computational argumentation synthesis as a lan-guage modeling task. In
Proceedings of the 12thInternational Conference on Natural Language Gen-eration , pages 54–64, Tokyo, Japan. Association forComputational Linguistics.Matthew Feinberg and Robb Willer. 2015. From gulfto bridge: When do moral arguments facilitate polit-ical influence?
Personality and Social PsychologyBulletin , 41(12):1665–1681.David M Godden. 2010. The importance of belief inargumentation: Belief, commitment and the effec-tive resolution of a difference of opinion.
Synthese ,172(3):397–414.loriana Grasso, Alison Cawsey, and Ray Jones. 2000.Dialectical argumentation to solve conflicts in ad-vice giving: a case study in the promotion ofhealthy nutrition.
International Journal of Human-Computer Studies , 53(6):1077–1115.Christopher Hidey and Kathleen McKeown. 2019.Fixed that for you: Generating contrastive claimswith semantic edits. In
Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers) , pages 1756–1767.Xinyu Hua, Zhe Hu, and Lu Wang. 2019. Argumentgeneration with retrieval, planning, and realization.In
Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages2661–2672.Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,Caiming Xiong, and Richard Socher. 2019. Ctrl: Aconditional transformer language model for control-lable generation. arXiv preprint arXiv:1909.05858 .Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-lart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In
Proceedings of ACL 2017, System Demonstrations ,pages 67–72, Vancouver, Canada. Association forComputational Linguistics.Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016. Apersona-based neural conversation model. In
Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 994–1003, Berlin, Germany. Associ-ation for Computational Linguistics.Chin-Yew Lin and Eduard Hovy. 2000. The automatedacquisition of topic signatures for text summariza-tion. In
COLING 2000 Volume 1: The 18th Interna-tional Conference on Computational Linguistics .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
OpenAIBlog , 1(8):9.Alan Ritter, Colin Cherry, and William B. Dolan. 2011.Data-driven response generation in social media. In
Proceedings of the 2011 Conference on EmpiricalMethods in Natural Language Processing , pages583–593, Edinburgh, Scotland, UK. Association forComputational Linguistics.Misa Sato, Kohsuke Yanai, Toshinori Miyoshi, Toshi-hiko Yanase, Makoto Iwayama, Qinghua Sun, andYoshiki Niwa. 2015. End-to-end argument gener-ation system in debating. In
Proceedings of ACL-IJCNLP 2015 System Demonstrations , pages 109–114.Benjamin Schiller, Johannes Daxenberger, and IrynaGurevych. 2020. Aspect-controlled neural argumentgeneration. arXiv preprint arXiv:2005.00084 . Irene Solaiman, Miles Brundage, Jack Clark, AmandaAskell, Ariel Herbert-Voss, Jeff Wu, Alec Radford,and Jasmine Wang. 2019. Release strategies and thesocial impacts of language models. arXiv preprintarXiv:1908.09203 .M. Stede, J. Schneider, and G. Hirst. 2018.
Argumenta-tion Mining .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In
Advances in neural information processing sys-tems , pages 3104–3112.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Henning Wachsmuth, Manfred Stede, Roxanne El Baff,Khalid Al Khatib, Maria Skeppstedt, and BennoStein. 2018. Argumentation synthesis followingrhetorical strategies. In
Proceedings of the 27th In-ternational Conference on Computational Linguis-tics , pages 3753–3765. Association for Computa-tional Linguistics.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-towicz, et al. 2019. Transformers: State-of-the-art natural language processing. arXiv preprintarXiv:1910.03771 .Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom BBrown, Alec Radford, Dario Amodei, Paul Chris-tiano, and Geoffrey Irving. 2019. Fine-tuning lan-guage models from human preferences. arXivpreprint arXiv:1909.08593 .Ingrid Zukerman, Richard McConachy, and SarahGeorge. 2000. Using argumentation strategies in au-tomated argument generation. In