LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content
Shreya Gupta, Parantak Singh, Megha Sundriyal, Md Shad Akhtar, Tanmoy Chakraborty
LLESA : Linguistic Encapsulation and Semantic Amalgamation BasedGeneralised Claim Detection from Online Content
Shreya Gupta †∗ , Parantak Singh ‡∗ , Megha Sundriyal † ,Md Shad Akhtar † , Tanmoy Chakraborty †† IIIT-Delhi, India. ‡ Birla Institute of Technology and Science, Pilani, Goa, India. { shreyag, meghas, shad.akhtar, tanmoy } @iiitd.ac.in , [email protected] Abstract
The conceptualization of a claim lies at thecore of argument mining. The segregationof claims is complex, owing to the diver-gence in textual syntax and context acrossdifferent distributions. Another pressing is-sue is the unavailability of labeled unstruc-tured text for experimentation. In this paper,we propose
LESA , a framework which aimsat advancing headfirst into expunging the for-mer issue by assembling a source-independentgeneralized model that captures syntactic fea-tures through part-of-speech and dependencyembeddings, as well as contextual featuresthrough a fine-tuned language model. We re-solve the latter issue by annotating a Twit-ter dataset which aims at providing a testingground on a large unstructured dataset. Ex-perimental results show that
LESA improvesupon the state-of-the-art performance acrosssix benchmark claim datasets by an average of3 claim-F1 points for in-domain experimentsand by 2 claim-F1 points for general-domainexperiments. On our dataset too,
LESA outper-forms existing baselines by 1 claim-F1 pointon the in-domain experiments and 2 claim-F1 points on the general-domain experiments.We also release comprehensive data annota-tion guidelines compiled during the annotationphase (which was missing in the current litera-ture).
The concept of a claim lies at the core of the ar-gument mining task. Toulmin (2003), in his argu-mentation theory, described the term ‘claim’ as ‘ anassertion that deserves our attention ’; albeit notvery precise, it still serves as an initial insight. Inrecent years, Govier (2013) described a ‘claim’ as ∗∗ First two authors have equal contributions. The workwas done when Parantak was an intern at LCS2 Lab, IIIT-Delhi. ‘ a disputed statement that we try to support withreasons .’The predicament behind the claim detection taskexists given the disparity in conceptualization andlack of a proper definition of a claim. The taskof claim detection across different domains hasgarnered tremendous attention so far owing to anuprise in social media consumption and by exten-sion the existence of fake news, online debates,widely-read blogs, etc. As an elementary exam-ple, claim detection can be used as a precursor tofact-checking; wherein segregation of claims aidsin restricting the corpus that needs a fact-check. Afew examples are shown in Table 1.Most of the existing works are built upon twofundamental pillars - semantic encapsulation (Dax-enberger et al., 2017; Chakrabarty et al., 2019) andsyntactic encapsulation (Levy et al., 2014; Lippiand Torroni, 2015). They mainly focus on adapt-ing to texts from similar distributions or topics orboth. Secondly, they often exercise against well-structured and laboriously pre-processed formaltexts owing to the lack of a labeled corpus con-sisting of unstructured texts. As a result, claimdetection from unstructured raw data still lies un-der a relatively less explored umbrella. Text Claim?
Alcohol cures corona. YesWearing mask can prevent corona. YesLord, please protect my family & thePhilippines from the corona virus. NoIf this corona scare doesn’t end soonimma have to intervene No
Table 1: A few examples of claim and non-claim.
Motivation:
Claims can be sourced from a va-riety of sources, e.g., online social media texts,microblogs, Wikipedia articles, etc. It is, how-ever, crucial to pay special attention to claims ob- a r X i v : . [ c s . C L ] J a n erved on online social media (OSM) sites (Baumet al., 2020; WHO, 2020). Twitter, being a majorOSM platform, provides the perfect playground fordifferent ideologies and perspectives. Over time,Twitter has emerged as the hub for short, unstruc-tured pieces of text that describe anything fromnews to personal life. Most individuals view andbelieve things that align with their compass andprior knowledge, aka conformity bias (Whalen andLaland, 2015) – users tend to make bold claimsthat usually create a clash between users of variedopinions. At times, these claims incite a negativeimpact on individuals and society. As an example,a tweet that reads “ alcohol cures corona ” can leadto massive retweeting and consequential unrest, es-pecially in times of a pandemic, when people aremore vulnerable to suggestions. In such cases, au-tomated promotion of claims for immediate furtherchecks could prove to be of utmost importance. Anautomated system is pivotal since OSM data is fartoo voluminous to allow for manual human checks,even if it was an expert.At the same time deploying separate systemscontingent on the source of a text is inefficient andmoves away from the goal of attaining human in-telligence in natural language processing tasks. Anideal situation would be a framework that can effec-tively detect claims in the general setting. However,a major bottleneck towards this goal is the unavail-ability of an annotated dataset from noisy platformslike Twitter. We acknowledge this bottleneck and,in addition to proposing a generalised framework,we develop a qualitative annotated resource andguidelines for claim detection in tweets. Proposed Method:
There exists several claimdetection models; however, the downside is thatmost of them are trained on structured text froma specific domain. Therefore, in this work, wepropose
LESA , a L inguistic E ncapsulation and S emantic A malgamation based generalized claimdetection model that is capable of accounting fordifferent text distributions, simultaneously. To for-malize this, we divide the text, contingent upontheir structure, into three broad categories – noisytext ( tweets ), semi-noisy text ( comments ), and non-noisy text ( news, essays , etc.). We model eachcategory separately in a joint framework and fusethem together using attention layers.Since the task of claim detection has a strongassociation with the structure of the input, as ar-gued by Lippi and Torroni (2015), we leverage two linguistic properties – part-of-speech (POS) and de-pendency tree, to capture the linguistic variationsof each category. Subsequently, we amalgamatethese features with BERT (Devlin et al., 2019) forclassification.We evaluate LESA on seven different datasets(including our Twitter dataset) and observe efficientperformance in each case. Moreover, we compare
LESA ’s performance against various state-of-the-art systems for all seven datasets in the generaland individual settings. The comparative studyadvocates the superior performance of
LESA . Summary of the Contributions:
We summarizeour major contributions below:•
Twitter claim detection dataset and compre-hensive annotation guidelines.
To mitigate theunavailability of an annotated dataset for claimdetection in Twitter, we develop a large COVID-19 Twitter dataset, the first of its kind, with ∼ , labeled tweets, following a compre-hensive set of claim annotation guidelines.• LESA , a generalized claim detection system.
We propose a generalized claim detection model,
LESA , that identifies the presence of claims in any online text, without prior knowledge of thesource and independent of the domain. To thebest of our knowledge, this is the first attemptto define a model that handles claim detectionfrom both structured and unstructured data inconjunction.•
Exhaustive evaluation and superior results.
We evaluate
LESA against multiple state-of-the-art models on six benchmark claim detectiondatasets and our own Twitter dataset. Com-parison suggests
LESA ’s superior performanceacross datasets and the significance of eachmodel component.
Reproducibility:
Code and dataset is publiclyavailable at https://github.com/LCS2-IIITD/LESA-EACL-2021 . Appendix comprises of detaileddataset description, annotation guidelines, hyper-parameters, and additional results.
In the past decade, the task of claim detection hasbecome a popular research area in text process-ing with an initial pioneering attempt by Rosenthaland McKeown (2012). They worked on miningclaims from discussion forums and employed aupervised approach with features based on senti-ment and word-grams. Levy et al. (2014) proposeda context dependent claim detection (CDCD) ap-proach. They described CDC as ‘ a general, con-cise statement that directly supports or conteststhe given topic.’
Their approach was evaluatedover Wikipedia articles; it detected sentences thatinclude CDCs using context-based and context-free features. This was followed by ranking anddetecting CDCs using logistic regression. Lippiand Torroni (2015) proposed context-independentclaim detection (CICD) using linguistic reasoning,and encapsulated structural information to detectclaims. They used constituency parsed trees to ex-tract structural information and predicted parts ofthe sentence holding a claim using SVM. Althoughtheir approach achieved promising results, theyalso used a Wikipedia dataset which was highlyengineered and domain dependent.Daxenberger et al. (2017) used six disparatedatasets and contrasted the performance of severalsupervised models. They performed two sets ofexperiments – in-domain CD (trained and tested onthe same dataset) and cross-domain CD (trained onone and tested on another unseen dataset). Theylearned divergent conceptualisations of claims overcross-domain datasets. Levy et al. (2017) proposedthe first unsupervised approach for claim detection.They hypothesised a “claim sentence query” as anordered triplet: (cid:104) that → MC → CL (cid:105) . Accordingto the authors, a claim begins with the word ‘that’and is followed by the main concept (MC) or topicname which is further followed by words froma pre-defined claim lexicon (CL). This approachwould not fit well for text stemming from socialmedia platforms owing to a lack of structure andthe use of ‘that’ as an offset for claim.In recent years transformer-based languagemodels have been employed for claim detection.Chakrabarty et al. (2019) used over 5 million self-labeled Reddit comments that contained the abbre-viations IMO (In My Opinion) or IMHO (In MyHonest Opinion) to fine-tune their model. However,they made no attempt to explicitly encapsulate thestructure of a sentence.Recently, the CLEF-2020 shared task (Barr´on-Cede˜no et al., 2020) attracted multiple modelswhich are tweaked specifically for claim detection.Williams et al. (2020) bagged the first position inthe task using a fine-tuned RoBERTa (Liu et al.,2019) model with mean pooling and dropout. First runner up of the challenge, Nikolov et al. (2020)used logistic regression on various meta-data tweetfeatures and a RoBERTa-based prediction. Cheemaet al. (2020), the second runner up, incorporatedpre-trained BERT embeddings along with POS anddependency tags as features trained using SVM.Traditional approaches focused primarily on thesyntactic representations of claims and textual fea-ture generation, while recent neural methods lever-age transformer models. With LESA , we attemptto learn from the past while building for the fu-ture – we propose encapsulating syntactic represen-tations in the form of POS tags and dependencysequences along with the semantics of the inputtext using transformer-based BERT (Devlin et al.,2019). Another key observation has been the use ofhighly structured and domain-engineered datasetsfor training the existing models in claim detection.In the current age of alarming disinformation, werecognise the augmented need for a system thatcan detect claims in online text independent of itsorigin, context or domain. Therefore, in additionto considering texts from different online mediums,we incorporate, for the first time, a self-annotatedlarge Twitter dataset to the relatively structureddatasets that exist in this field.
Traditionally, the narrative on claim detection isbuilt around either syntactic (Levy et al., 2017;Lippi and Torroni, 2015) or semantic (Daxenbergeret al., 2017; Chakrabarty et al., 2019) properties ofthe text. However, given our purview on the integra-tion of both, we propose a combined model,
LESA , that incorporates exclusively linguistic featuresleveraged from part-of-speech (POS) tags and de-pendency tree (DEP) as well as semantic featuresleveraged from transformer-based model, BERT(Devlin et al., 2019).By the virtue of digital media, we generally dealwith texts from three kind of environments: (a) acontrolled platform where content is pre-reviewed(e.g., news, essays, etc.); (b) a free platform whereauthors have the freedom to express themselveswithout any restrictions on the length (e.g., onlinecomments, Wikipedia talk pages); and (c) a freeplatform with restrictions on the text length (e.g.,tweets). The texts in the first category is usuallyfree of any grammatical and typographical mis-takes, and thus belong to the non-noisy category.On the other hand, in the third case, texts exhibit
EPPOS
BERTEmbedding layer Embedding layer
Trigram PoS sequence Trigram dependency sequence … BiLSTM … Softmax [Auxiliary output] Transformer Encoder
Positional encoding Parent encoding
Attention Softmax [Primary output]
Global Average Pooling
AttentionInput Softmax [Auxiliary output] BERTPOS semi-Noisy
POS
Noisy
POS non-Noisy
DEP semi-Noisy
DEP
Noisy
DEP non-Noisy non-ClaimClaim
Figure 1: Schematic diagram of our proposed
LESA model. The structure on the right is a high level schematicdiagram. Structure on the left shows POS and DEP for one viewpoint. a significant amount of noise, in terms of spellingvariations, hashtags, emojis, emoticons, abbrevia-tions, etc., to express the desired information withinthe permissible limit, thus it belongs to the noisyclass. The second case is a mixture of the two ex-treme cases and hence, constitutes the semi-noisycategory. We employ three pre-trained models rep-resenting noisy, semi-noisy, and non-noisy data forboth POS and dependency-based features. The intu-ition is to leverage the structure-specific linguisticfeatures in a joint framework.Domain adaptation from a structured environ-ment to an unstructured one is non-trivial and re-quires specific processing. Therefore, to ensuregeneralization, we choose to process each input textfrom three different viewpoints (structure-basedsegregation), and intelligently select the contribut-ing features among them through an attention mech-anism. We use it to extract the POS and DEP-based linguistic features. Subsequently, we fusethe linguistic and semantic features using anotherattention layer before feeding it to a multilayerperceptron (MLP) based classifier. The idea is toamalgamate diverse set of features from differentperspectives and leverage them for the final clas-sification. A high-level architectural diagram isdepicted in Figure 1. We design parallel pillars foreach viewpoint (right side of Figure 1) such that thenoisy pillar contains pre-trained information fromthe noisy source and so on. When the commondata is passed through the three pillars we hypothe-size each pillar’s contribution dependending on thetype of input data. For example, if the data sourceis from a noisy platform, we hypothesize that thenoisy pillars will have more significance than the other two viewpoints. We demonstrate this effectin Table 5.
A. Part-of-speech (POS) Module
The POS module consists of an embedding layerfollowed by a BiLSTM and an attention layer toextract the syntactic formation of the input text. Wepre-train the POS module for each viewpoint, andlater fine-tune them while training the integratedmodel.At first, each sequence of tokens { x , x , ..., x n } is converted to a sequence of corresponding POStags resulting into the set { p , p , ..., p n } . However,the foremost limitation of this modeling strategyis the limited and small vocabulary size of ow-ing to a specific number of POS tags. To tacklethis, we resort to using k -grams of the sequence.The sequence of POS tags (with k = 3 ) nowbecomes { ( p , p , p ) , ( p , p , p ) , ( p , p , p ) , ..., ( p n − , p n − , p n ) , ( p n − , p n , p n +1 ) } , where p and p n +1 are dummy tags. Subsequently, a skip-grammodel (Mikolov et al., 2013) is trained on the POS-transformed corpus of each dataset, which therebytranslates to a POS embedding, E P . B. Dependency Tree (DEP) Module
Dependency parsing is the function of abstractingthe grammatical assembly of a sequence of tokens { x , x , ..., x n } such that there exists a directed re-lation (dependency), d ( x i , x j ) , between any twotokens x i and x j , where x i is the headword and x j is modified by the headword. We obtain thesedependency relations through spaCy which uses he clearNLP guidelines. Initially, each sequence isrendered into a combination of the dependency-tag arrangement { d , d , ..., d n } and a parent-position arrangement { pp , pp , ...pp n } . Here, each d j rep-resents a dependency tag, where x j is modified by x i , and pp j is the index of the modifier (headword) x i .We then leverage the transformer encoder(Vaswani et al., 2017), where traditionally, aposition-based signal is added to each token’s em-bedding to help encode the placement of tokens.In our modified version, the token sequence is the dependency-tag sequence d e = { d , d , ..., d n } ,wherein a parent-position based signal is addition-ally added to encode the position of the modifierwords. d (cid:48) e = d e + [( E p , E pp ) , ..., ( E p n , E pp n )] (1)where d (cid:48) e ∈ R d × n is the modified dependency em-bedding of a sequence of length n , E p i and E pp i are the encodings for the token-position and the parent-position (position of token’s modifier), and ( , ) represent tuple brackets.This helps us create a flat representation for adependency graph. The transformer-architecturethat we employ comprises of 5 attention headswith an embedding size of 20. Given that thereare only a handful of dependency relations, this,still poses the problem of a limited vocabulary sizeof 37. Having accounted for the parent positionsalready, we decide to again employ tri-gram se-quences { ( d , d , d ) , ( d , d , d ) , ( d , d , d ) , ..., ( d n − , d n − , d n ) , ( d n − , d n , d n +1 ) } in place ofuni-grams. A number of datasets exist for the task of claimdetection in online text (Peldszus and Stede, 2015;Stab and Gurevych, 2017); however, most of themare formal and structured texts. As we discussedearlier, OSM platforms are overwhelmed with vari-ous claim-ridden posts. Despite the abundance oftweets, literature does not suggest any significanteffort for detecting claims in Twitter; Arguably, theprime reason is the lack of a large-scale dataset.Recently, a workshop on claim detection and ver-ification in Twitter was organized under CLEF-2020 (Barr´on-Cede˜no et al., 2020). It had twosubtasks related to claim identification with sep-arate datasets. The first dataset consists of , COVID-19 tweets for claim detection; whereas, the
Our AnnotationCLEF-2020 Non-claim ClaimNon-claim 301 47Claim 64 550
Table 2: Confusion matrix highlighting the differencesand similarities between Alam et al. (2020) and our an-notation guidelines for CLEF-2020 claim dataset. second one comprises of another , tweets forclaim retrieval. In total, there were , annotatedtweets of which , had claims and werenon-claims. Another recent in-progress dataseton claim detection, which currently has only claim and non-claim tweets, was released byAlam et al. (2020).Unfortunately, the aforementioned limited in-stances are insufficient to develop an efficientmodel. Therefore, we attempt to develop a new andrelatively larger dataset for claim detection in OSMplatforms. We collected ∼ , tweets from var-ious sources (Carlson, 2020; Smith, 2020; Celin,2020; Chen et al., 2020; Qazi et al., 2020) and man-ually annotated them. We additionally includedclaim detection datasets of Alam et al. (2020) andCLEF-2020 (Barr´on-Cede˜no et al., 2020) and re-annotated them in accordance with our guidelines.During the cleaning process, we filtered a majorityof tweets due to their irrelevancy and duplicacy. Toensure removal of duplicates, we performed man-ual checking and exhaustive preprocessing. Data Annotation:
To annotate the tweets, we ex-tend and adapt the claim annotation guidelines ofAlam et al. (2020). The authors targeted and anno-tated only a subset of claims, i.e., factually verifi-able claims. They did not consider personal opin-ions, sarcastic comments, implicit claims, or claimsexisting in a sub-sentence or sub-clause level. Sub-sequently, we propose our definition of claims andextrapolate the existing guidelines to be more in-clusive, nuanced and applicable to a diverse set ofclaims. Our official definition for claims, adoptedfrom Oxford dictionary , is to state or assert thatsomething is the case, with or without providingevidence or proof .We present the details of annotation guidelinesin Gupta et al. (2021). Following the guidelines, weannotated the collected tweets, and to ensure coher-ence and conformity, we re-annotated the tweetsof Alam et al. (2020) and CLEF-2020 (Barr´on- ataset TextNoisy TWR @realDonaldTrump Does ingesting bleach and shining a bright light in the rectal areareally cure
Semi-noisy
OC *smacks blonde wig on Axel* I think as far as DiZ is concerned, he is very smart but also incertain areas very dumb - - witness the fact that he didn’t notice his apprentices were going toturn on him, when some of them (cough Vexen cough) aren’t exactly subtle by nature.WTP Not to mention one without any anonymous users TALKING IN CAPITAL LETTERS !!!!!!!!
Non-noisy
MT Tax data that are not made available for free should not be acquired by the state.PE I believe that education is the single most important factor in the development of a country.VG When’s the last time you slipped on the concept of truth?WD The public schools are a bad place to send a kid for a good education anymore.
Table 3: One example from each dataset. Underlined text highlights noisy and semi-noisy phrases.
Dataset Noisy Semi-noisy Non-noisyTWR OC WTP MT PE VG WD
Tr Cl 7354 623 1030 100 1885 495 190N-cl 1055 7387 7174 301 4499 2012 3332Ts Cl 1296 64 105 12 223 57 14N-cl 189 730 759 36 509 221 221Tot Cl 8650 687 1,135 112 2,108 552 204N-cl 1244 8117 7933 337 5008 2233 3553
Table 4: Statistics of the datasets (Abbreviations: Cl:Claim, N-Cl: Non-claim, Tr: Train set, Ts: Test set;Tot: Total).
Cede˜no et al., 2020). It is intriguing to see thedifferences and similarities of the two guidelines;therefore, we compile a confusion matrix for CLEF-2020 claim dataset, as presented in Table 2. Eachtweet in our corpus of , tweets has been an-notated by at least two annotators, with an averageCohen’s kappa inter-annotator agreement (Cohen,1960) score of . . In case of a disagreement, thethird annotator was considered and a majority votewas used for the final label. All annotators werelinguists. Other Datasets:
Since we attempt to create a gen-eralized model that is able to detect the presenceof a claim in any online text, we accumulate, inaddition to the Twitter dataset, six publicly avail-able benchmark datasets: (i) Online Comments(OC) containing Blog threads of LiveJournal (Biranand Rambow, 2011), (ii) Wiki Talk Pages (WTP)(Biran and Rambow, 2011), (iii) German Micro-text (MT) (Peldszus and Stede, 2015), (iv) Per-suasive Student Essay (PE) (Stab and Gurevych,2017), (v) Various Genres (VG) containing news-paper editorials, parliamentary records and judicialsummaries, and (vi) Web Discourse (WD) contain-ing blog posts or user comments (Habernal andGurevych, 2015). All datasets utilised in this papercontain English texts only. For German Microtexts(MT), we used the publicly available English trans-lated version published by MT’s original authors (Peldszus and Stede, 2015)). The same was utilizedby Chakrabarty et al. (2019).The datasets are formed by considering text atthe sentence level. For example, in Persuasive Es-says (PE) dataset, each essay is broken into sen-tences and each sentence is individually annotatedfor a claim. Considering the structure of the inputtexts in these datasets, we group them into threecategories as follows: Noisy (Twitter), Semi-noisy(OC, WTP), Non-noisy (MT, PE, VG, WD). Welist one example from each dataset in Table 3. Wealso highlight the noisy and semi-noisy phrasesin Twitter, and OC and WTP datasets respectively.Moreover, we present detailed statistics of all sevendatasets in Table 4.
For all datasets besides twitter, we use the train, val-idation, and test splits as provided by UKP Lab . Amutually exhaustive 70:15:15 split was maintainedfor Twitter dataset. We compute POS embeddingsby learning word2vec skip-gram model (Mikolovet al., 2013) on the tri-gram POS sequence. Forthe skip-gram model, we set context window = 6 , embedding dimension = 20 , and discard the POSsequence with frequency ≤ . Subsequently, wecompute dependency embeddings with dimension = 20 using Transformer (Vaswani et al., 2017) en-coder with attention heads . Please note that thechoice of using Bi-LSTM, as oppose to Transform-ers, for extracting the POS features is empirical .The outputs of the POS and dependency em-bedding layers are subsequently fed to a BiL-STM and GlobalAveragePooling layers, respec-tively. Their respective outputs are projected to https://tinyurl.com/yyckv29p Choice of n = 3 is empirical. We report supportingexperimental results in Gupta et al. (2021). Gupta et al. (2021) accompanies the supporting results. odels Noisy Semi-Noisy Non-Noisy Wt Avg
Twitter OC WTP MT PE VG WD m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F BERT 0.60 0.83 0.52
LESA (Combined-view) 0.61 0.85 0.51 0.23 0.53 0.31 0.77 0.71 0.71 0.64 0.57 0.40 0.48 0.22 0.59
LESA (768 dim ) 0.58 0.80 0.52
LESA (32 dim ) Table 5: Macro F1 ( m - F ) and Claim-F1 ( c - F ) for ablation studies. a 32-dimensional representation for the fusion.We employ HuggingFace’s BERT implementa-tion for computing the tweet representation. The768-dimensional embeddings is projected to a 32-dimensional representation using linear layers. Weprogress with the 32-dimensional representationof BERT as we observe no elevation in results onusing the 768-dimensional representation, as canbe seen in Table 5. Besides, the latter results in ∼ million trainable parameters, whereas the for-mer requires ∼ . million trainable parameters.We employ sparse categorical cross-entropy losswith Adam optimizer and use softmax for the finalclassification. For evaluation, we adopt macro-F1 ( m - F ) and claim-F1 ( c - F ) scores used bythe existing methods (Daxenberger et al., 2017;Chakrabarty et al., 2019).We perform our experiments in two setups. Inthe first in-domain setup , we train, validate andtest on the same dataset and repeat it for all sevendatasets independently. In the second general-domain setup , we combine all datasets and train aunified generic model. Subsequently, we evaluatethe trained model on all seven datasets individually.Furthermore, for each experiment, we ensure abalanced training set by down-sampling the dom-inant class at ratio. However, we use theoriginal test set for a fair comparison against theexisting baselines and state-of-the-art models. Table 5 shows m - F and c - F for different vari-ants of LESA . We begin with a fine-tuned BERTmodel and observe the performance on test setsof all seven datasets. On the Twitter dataset, theBERT architecture yields m - F score of 0.60 and c - F score of 0.83. We also report the weighted-average score as 0.58 m - F and 0.73 c - F , in thelast two columns of Table 5. Since we hypothesizethat claim detection has a strong association withthe structure of the text, we amalgamate POS and Gupta et al. (2021) accompanies other hyperparameters. dependency (DEP) information with the BERT ar-chitecture in a step-wise manner. The BERT+POSmodel reports an increase of 1% m - F and c - F scores on the Twitter dataset. We observe similartrends in other datasets and the overall weighted-average score as well. We also perform experi-ments on other permutations, and their results arelisted in Table 5. Finally, we combine both POSand DEP modules with the BERT architecture ( aka. LESA ). It obtains improved results for most of thecases, as shown in the last row of Table 5. The bestresult on average stands at 0.61 m - F and 0.75 c - F for the proposed LESA model. This servesas a testament to our hypothesis, validating our as-sumption that combining syntactic and semanticrepresentations leads to better detection of claims.In all aforementioned experiments, we use ourpre-defined concept of three viewpoints, i.e., noisy,semi-noisy and non-noisy. Therefore, for com-pleteness, we also construct a combined viewpointwhich does not contain any structure-specific pillarin POS or DEP branches. The results from this abla-tion experiment are reported in
LESA (Combined-view) row. We observe that the combined-viewresults are inferior to the variant with separate view-points for each component (c.f. second last and lastrow of Table 5 respectively). Thus, providing atten-tion to datasets based on the noise in their contentis demonstrated by a significant increase of ∼ m - F from combined viewpoint to separate view-points experiment. A. Baselines and Comparative Analysis
We employ the following baselines (some of thembeing state-of-the-art systems for claim detectionand text classification): (cid:46)
XLNet (Yang et al.,2019): It is similar to the BERT model, where wefine-tune XLNet for the claim detection; (cid:46)
Accen-ture (Williams et al., 2020): A RoBERTa-basedsystem that ranked first in the CLEF-2020 claim de-tection task (Barr´on-Cede˜no et al., 2020); (cid:46)
TeamAlex (Nikolov et al., 2020): The second-ranked sys-tem at CLEF-2020 task that fused tweet meta-data odels Noisy Semi-Noisy Non-Noisy Wt Avg
Twitter OC WTP MT PE VG WD m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F BERT 0.50 0.67 0.50 0.24 0.36 0.27 0.75 0.69 0.73 † LESA
Table 6: Macro F1 ( m - F ) and F1 for claims ( c - F ) in the in-domain setup. For CrossDomian, the asterisk (*)indicates results taken from Daxenberger et al. (2017) and the dagger ( † ) represents the reproduced results. Model Noisy Semi-Noisy Non-Noisy Wt Avg
Twitter OC WTP MT PE VG WD m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F m - F c - F BERT 0.60 0.83 0.52 0.24 0.53
LESA
Table 7: Macro F1 ( m - F ) and Claim-F1 ( c - F ) in the general-domain setup. Models Noisy Semi-Noisy Non-Noisy m - F c - F m - F c - F m - F c - F BERT 0.60 0.83 0.52
LESA
Table 8: Category-wise weighted-average F1 scores. into RoBERTa for the final prediction; (cid:46)
Check-Square (Cheema et al., 2020): An SVM-basedsystem designed on top of pre-trained BERT em-beddings in addition to incorporating POS and de-pendency tags as external features. (cid:46)
CrossDo-main (Daxenberger et al., 2017): Among severalvariations reported in the paper, their best modelincorporates CNN (random initialization) for thedetection. We reproduce the top submissions fromCLEF-2020 challenge using the best performingmodels mentioned in the referenced papers. Codefor CheckSquare was provided online. For Accen-ture and Team Alex we reproduce their methodsusing the hyper-parameters mentioned in the paper.We evaluate all baselines using the same train andtest set as for
LESA .We report our comparative analysis for the in-domain setup in Table 6. We observe that
LESA ob-tains best c - F scores for six out of seven datasets.Additionally, it achieves a weighted average c - F of . which is . improvement over the bestperforming baseline. In terms of m - F values,our weighted average ranks second next to Cross- Domain. We reproduced CrossDomain baselineusing their GitHub code (UKPLab, 2017). If thereproduced values are considered, our model out-performs all other models in m - F value as well.Similarly, we compile the results for the general-domain setup in Table 7. In the non-noisy category, LESA obtains better m - F scores than three of thefour state-of-the-art systems, i.e., it reports . , . , and . m - F scores compared to . , . , and . m - F scores of the comparative sys-tems on MT, PE, and VG test sets, respectively. OnWD, we observe similar m - F and c - F scores forboth the best baseline and LESA . On the datasetsin other categories, we observe comparative m - F scores; however, none of the baselines are consis-tent across all dataset – e.g., CrossDomain (Daxen-berger et al., 2017) reports the best m - F scoreson Twitter and OC, but yields (joint) fourth-bestperformance on WTP. Moreover, LESA yields thebest m - F score across the seven datasets on aver-age with ≥ improvements. On the other hand,we obtain best c - F scores for five out of sevendatasets. In addition, LESA reports overall c - F of 0.75 with a significant improvement of ≥ .Using a paired T-test, LESA showed significantstatistical improvement compared against BERTin m - F and c - F for the noisy dataset with p-values . and <. respectively. Resultswere also significant for m - F and c - F for PEand m - F for WD. The small sample size in somedatasets like MT and VG does not allow a reliablecalculation of test statistics. xample Gold Prediction LESA
CrossDomain
TWR x
28 coronaoutbreak cases thus far in india italian tourists 16 their driver1 kerala 3 cureddischarged agra 6 delhi 1 noida school dad telangana 1coronavirusindia x can we just call this a cure now x Besides it should be in the interest of the health insurers to recognize alter-native medicine as treatment, since there is a chance of recovery. x On the other hand, fossil fuels are abundant and inexpensive in many areas x Daily exercise will help also to develop children’s brain function. x Skinny Puppy is headlining Festival Kinetik ! x I guess I’m not desensitized enough to just forget about people being mur-dered in my neighborhood. x No wonder 50 million babies have been aborted since 1973 .
Table 9: Error analysis of the outputs. Red texts highlight errors.
Since our work intends to developing a modelthat is able to detect claims irrespective of thesource and origin of text, we also analyse theweighted-average scores for each category in Ta-ble 8. We observe that
LESA obtains the best c - F scores in each category, in addition to thebest m - F score in non-noisy category as well.For the other two categories, LESA yields com-parative performances. The results are better fornoisy data than for non-noisy owing to the smallsize and skewness against claims in the latter’s testset. Therefore, misclassification of a single claimcauses severe penalization to c - F . B. Error Analysis
It is apparent from the results that all systems (in-cluding
LESA ) committed some errors in claimdetection. Thus, in this section, we explore whereour system misclassified the inputs by analysingsome examples. Table 9 presents a few instancesalong with the gold labels and the predictions of thebest-performing baseline, CrossDomain (Daxen-berger et al., 2017), for comparison. In some cases,both
LESA and CrossDomain failed to classify theinstances correctly, whereas in others,
LESA classi-fies the instances correctly but CrossDomain couldnot. We also report intuitions for the misclassifi-cation by
LESA in some cases. The presence ofnumbers and statistics could be the reason behindthe misclassifications in examples x and x . Ex-ample x contains two weak phrases ( ‘alternativemedicine as treatment’ and ‘there is a chance ofrecovery’ ) which are most likely the cause of mis-classification. The former might have been inter-preted as suggestion backed up by some evidence,while in the latter phrase, LESA might have mis-interpreted the optimism with claim. Furthermore,the phrase ‘fossil fuels are abundant’ in example x reflects world knowledge instead of a claim, asinterpreted by LESA . In this paper, we addressed the task of claim detec-tion from online posts. To do this, we proposed ageneric and novel deep neural framework,
LESA ,that leverages the pre-trained language model andtwo linguistic features, corresponding to the syn-tactic properties of input texts, for the final classi-fication. Additionally, we tackled the texts fromdistinct sources for the claim detection task in anovel way. In particular, we categorized the inputtext as noisy, non-noisy, and semi-noisy based onthe source, and modeled them separately. Subse-quently, we fused them together through an atten-tion module as the combined representation.One of the major bottlenecks of claim detectionin online social media platforms is the lack of quali-tative annotation guidelines and a sufficiently largeannotated dataset. Therefore, we developed a largeTwitter dataset of ∼ , manually annotatedtweets for claim detection. In addition to our twit-ter dataset, we employed six benchmark datasets(representing either semi-noisy or non-noisy inputchannels) for evaluation of the proposed model.We compared the performance of LESA againstfour state-of-the-art systems and two pre-trainedlanguage models. Comparison showed the superi-ority of the proposed model with ≥ claim-F1and ≥ macro-F1 improvements compared tothe best performing baselines on average. As a by-product of the study, we released a comprehensiveguideline for claim annotation. cknowledgement We would like to thank Rituparna and LCS2 mem-bers for helping in data annotation. The work waspartially supported by Accenture Research Grant,Ramanujan Fellowship, and CAI, IIIT-Delhi.
References
Alam, Shaden Shaar, Fahim Dalvi, Hassan Sajjad,Alex Nikolov, Hamdy Mubarak, Giovanni Da SanMartino, Ahmed Abdelali, Nadir Durrani, KareemDarwish, and Preslav Nakov. 2020. Fighting thecovid-19 infodemic: Modeling the perspective ofjournalists, fact-checkers, social media platforms,policy makers, and the society. arXiv preprintarXiv:2005.00033 .Alberto Barr´on-Cede˜no, Tamer Elsayed, PreslavNakov, Giovanni Da San Martino, Maram Hasanain,Reem Suwaileh, and Fatima Haouari. 2020. Check-that! at clef 2020: Enabling the automatic identifi-cation and verification of claims in social media. In
Advances in Information Retrieval , pages 499–507,Cham. Springer International Publishing.Matthew A. Baum, Katherine Ognyanova, HanyuChwe, Alexi Quintana, Roy H. Perlis, David Lazer,James Druckman, Mauricio Santillana, Jennifer Lin,John Della Volpe, Matthew Simonson, and JonGreen. 2020. The state of the nation: A 50-statecovid-19 survey report , pages162–168.Or Biran and Owen Rambow. 2011. Identifying justifi-cations in written dialogs by classifying text as argu-mentative.
International Journal of Semantic Com-puting , 05:363–381.Carlson. 2020. Coronavirus tweets.Sven Celin. 2020. Covid-19 tweets afternoon31.03.2020.Tuhin Chakrabarty, Christopher Hidey, and KathleenMcKeown. 2019. Imho fine-tuning improves claimdetection. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages558–563.Gullal S. Cheema, Sherzod Hakimov, and Ralph Ew-erth. 2020. Check square at checkthat! 2020: Claimdetection in social media via fusion of transformerand syntactic features. arXiv: 2007.10534 . Emily Chen, Kristina Lerman, and Emilio Ferrara.2020. Tracking social media discourse about thecovid-19 pandemic: Development of a public coron-avirus twitter data set.
JMIR Public Health Surveill ,6(2):e19273.J. Cohen. 1960. A Coefficient of Agreement for Nom-inal Scales.
Educational and Psychological Mea-surement , 20(1):37.Johannes Daxenberger, Steffen Eger, Ivan Habernal,Christian Stab, and Iryna Gurevych. 2017. What isthe essence of a claim? cross-domain claim identi-fication. In
Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 2055–2066.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Trudy Govier. 2013.
A practical study of argument .Cengage Learning.Shreya Gupta, Parantak Singh, Megha Sundriyal,Md Shad Akhtar, and Tanmoy Chakraborty. 2021.LESA: Linguistic Encapsulation and SemanticAmalgamation Based Generalised Claim Detectionfrom Online Content (Supplementary).Ivan Habernal and Iryna Gurevych. 2015. Exploit-ing debate portals for semi-supervised argumenta-tion mining in user-generated web discourse. In
Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages 2127–2137, Lisbon, Portugal. Association for Computa-tional Linguistics.Ran Levy, Yonatan Bilu, Daniel Hershcovich, EhudAharoni, and Noam Slonim. 2014. Context depen-dent claim detection. In
Proceedings of COLING2014, the 25th International Conference on Compu-tational Linguistics: Technical Papers , pages 1489–1500.Ran Levy, Shai Gretz, Benjamin Sznajder, Shay Hum-mel, Ranit Aharonov, and Noam Slonim. 2017. Un-supervised corpus–wide claim detection. In
Pro-ceedings of the 4th Workshop on Argument Mining ,pages 79–84.Marco Lippi and Paolo Torroni. 2015. Context-independent claim detection for argument mining.In
Twenty-Fourth International Joint Conference onArtificial Intelligence , pages 185–191.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .omas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. arXiv: 1301.3781 .Alex Nikolov, Giovanni Da San Martino, Ivan Koychev,and Preslav Nakov. 2020. Team alex at clef check-that! 2020: Identifying check-worthy tweets withtransformer models. arXiv:2009.02931 .Andreas Peldszus and Manfred Stede. 2015. Joint pre-diction in MST-style discourse parsing for argumen-tation mining. In
Proceedings of the 2015 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 938–948, Lisbon, Portugal. Asso-ciation for Computational Linguistics.Umair Qazi, Muhammad Imran, and Ferda Ofli. 2020.Geocov19: a dataset of hundreds of millions of mul-tilingual covid-19 tweets with location information.
SIGSPATIAL Special , 12(1):6–15.Sara Rosenthal and Kathleen McKeown. 2012. De-tecting opinionated claims in online discussions. In , pages 30–37. IEEE.Shane Smith. 2020. Coronavirus (covid19) tweets -early april.Christian Stab and Iryna Gurevych. 2017. Parsing ar-gumentation structures in persuasive essays.
Com-putational Linguistics , 43(3):619–659.Stephen E Toulmin. 2003.
The uses of argument . Cam-bridge university press.UKPLab. 2017. Ukplab/emnlp2017-claim-identification.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Andrew Whalen and Kevin Laland. 2015. Conformitybiased transmission in social networks.
Journal ofTheoretical Biology , 380:542–549.WHO. 2020. Immunizing the public against misinfor-mation.Evan Williams, Paul Rodrigues, and Valerie Novak.2020. Accenture at checkthat! 2020: If you say so:Post-hoc fact-checking of claims using transformer-based models. arXiv: 2009.02431 .Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In