[PDF] Contradiction Detection for Rumorous Claims

Abstract

The utilization of social media material in journalistic workflows is increasing, demanding automated methods for the identification of mis- and disinformation. Since textual contradiction across social media posts can be a signal of rumorousness, we seek to model how claims in Twitter posts are being textually contradicted. We identify two different contexts in which contradiction emerges: its broader form can be observed across independently posted tweets and its more specific form in threaded conversations. We define how the two scenarios differ in terms of central elements of argumentation: claims and conversation structure. We design and evaluate models for the two scenarios uniformly as 3-way Recognizing Textual Entailment tasks in order to represent claims and conversation structure implicitly in a generic inference model, while previous studies used explicit or no representation of these properties. To address noisy text, our classifiers use simple similarity features derived from the string and part-of-speech level. Corpus statistics reveal distribution differences for these features in contradictory as opposed to non-contradictory tweet relations, and the classifiers yield state of the art performance.

Full PDF

CContradiction Detection for Rumorous Claims

Piroska Lendvai

Computational LinguisticsSaarland UniversitySaarbr¨ucken, Germany [email protected]

Uwe D. Reichel

Research Institute for LinguisticsHungarian Academy of SciencesBudapest, Hungary [email protected]

Abstract

The utilization of social media material in journalistic workﬂows is increasing, demanding au-tomated methods for the identiﬁcation of mis- and disinformation. Since textual contradictionacross social media posts can be a signal of rumorousness, we seek to model how claims inTwitter posts are being textually contradicted. We identify two different contexts in which con-tradiction emerges: its broader form can be observed across independently posted tweets and itsmore speciﬁc form in threaded conversations. We deﬁne how the two scenarios differ in termsof central elements of argumentation: claims and conversation structure. We design and evaluatemodels for the two scenarios uniformly as 3-way Recognizing Textual Entailment tasks in or-der to represent claims and conversation structure implicitly in a generic inference model, whileprevious studies used explicit or no representation of these properties. To address noisy text,our classiﬁers use simple similarity features derived from the string and part-of-speech level.Corpus statistics reveal distribution differences for these features in contradictory as opposed tonon-contradictory tweet relations, and the classiﬁers yield state of the art performance.

Assigning a veracity judgment to a claim appearing on social media requires complex procedures includ-ing reasoning on claims aggregated from multiple microposts, to establish claim veracity status (resolvedor not) and veracity value (true or false). Until resolution, a claim circulating on social media platformsis regarded as a rumor (Mendoza et al., 2010). The detection of contradicting and disagreeing micropostssupplies important cues to claim veracity processing procedures. These tasks are challenging to autom-atize not only due to the surface noisiness and conciseness of user generated content. One complicatingfactor is that claim denial or rejection is linguistically often not explicitly expressed, but appears withoutclassical rejection markers or modality and speculation cues (Morante and Sporleder, 2012). Explicit andimplicit contradictions furthermore arise in different contexts: in threaded discussions, but also acrossindependently posted messages; both contexts are exempliﬁed in Figure 1 on Twitter data.Language technology has not yet solved the processing of contradiction-powering phenomena, suchas negation (Morante and Blanco, 2012) and stance detection (Mohammad et al., 2016), where stanceis deﬁned to express speaker favorability towards an evaluation target, usually an entity or concept. Inthe veracity computation scenario we can speak of claim targets that are above the entity level:targetsare entire rumors, such as ’11 people died during the Charlie Hebdo attack’. Contradiction and stancedetection have so far only marginally been addressed in the veracity context (de Marneffe et al., 2012;Ferreira and Vlachos, 2016; Lukasik et al., 2016).We propose investigating the advantages of incorporating claim target and conversation context aspremises in the Recognizing Textual Entailment (RTE) framework for contradiction detection in rumor-ous tweets. Our goals are manifold: (a) to offer richer context in contradiction modeling than whatwould be available on the level of individual tweets, the typical unit of analysis in previous studies; (b)to train and test supervised classiﬁers for contradiction detection in the RTE inference framework; (c) toaddress contradiction detection at the level of text similarity only, as opposed to semantic similarity (Xuet al., 2015); (d) to distinguish and focus on two different contradiction relationship types, each involvingspeciﬁc combinations of claim target mention, polarity, and contextual proximity, in particular: a r X i v : . [ c s . C L ] N ov igure 1: Explicit (far left: in threads, left: in independent posts) vs implicit (right: in threads, far right:in independent posts) contradictions in threaded discussions and in independent posts.1. Independent contradictions : Contradictory relation between independent posts, in which twotweets contain different information about the same claim target that cannot simultaneously hold.The two messages are independently posted, i.e., not occurring within a structured conversation.2.

Disagreeing replies : Contradictory relation between a claim-originating tweet and a direct reply toit, whereby the reply expresses disagreement with respect to the claim-introducing tweet.Contradiction between independently posted tweets typically arises in a broad discourse setting, andmay feature larger distance in terms of time, space, and source of information. The claim target is men-tioned in both posts in the contradiction pair, since these posts are uninformed about each other or assumeuninformedness of the reader, and thus do not or can not make coreference to their shared claim target.Due to the same reason, the polarity of both posts with respect to the claim can be identical. Texts pairedin this type of contradiction resemble those of the recent Interpretable Semantic Similarity shared task(Agirre et al., 2016) that calls to identify ﬁve chunk level semantic relation types (equivalence, opposi-tion, speciﬁcity, similarity or relatedness) between two texts that originate from headlines or captions.Disagreeing replies are more speciﬁc instances of contradiction: contextual proximity is small and triv-ially identiﬁable by means of e.g. social media platform metadata, for example the property encoding thetweet ID to which the reply was sent, which in our setup is always a thread-initiating tweet. The claimtarget is by deﬁnition assumed to be contained in the thread-initiating tweet (sometimes termed as claim-or rumor-source tweet). It can be the case that the claim target is not contained in the reply, which canbe explained by the proximity and thus shared context of the two posts. The polarity values in sourceand reply must by deﬁnition be different; we refer to this scenario as Disagreeing replies. Importantly,replies may not contain a (counter-)claim on their own but some other form to express disagreement andpolarity – for example in terms of speculative language use, or the presence of extra-linguistic cues suchas a URL pointing to an online article that holds contradictory content. Such cues are difﬁcult to decodefor a machine, and their representation for training automatic classiﬁers is largely unexplored. Note thatwe do not make assumptions or restrictions about how the claim target is encoded textually in any of thetwo scenarios.In this study, we tackle both contradiction types using a single generic approach: we recast them asthree-way RTE tasks on pairs of tweets. The ﬁndings of our previous study in which semantic inferencesystems with sophisticated, corpus-based or manually created syntactico-semantic features were appliedto contradiction-labeled data indicate the lack of robust syntactic and semantic analysis for short andnoisy texts; cf. Chapter 3 in (Lendvai et al., 2016b). This motivates our current simple text similaritymetrics in search of alternative methods for the contradiction processing task.In Section 2 we introduce related work and resources, in Sections 3 and 4 present and motivate thecollections and the features used for modeling. After the description of method and scores in Section 5,ﬁndings are discussed in Section 6.

Related work and resources

Recognizing Textual Entailment (RTE)

Processing semantic inference phenomena such as contra-diction, entailment and stance between text pairs has been gaining momentum in language technology.Inference has been suggested to be conveniently formalized in the generic framework of RTE (Dagan etal., 2006). As an improvement over the binary Entailment vs Non-entailment scenario, three-way RTEhas appeared but is still scarcely investigated (Ferreira and Vlachos, 2016; Lendvai et al., 2016a). The Entailment relation between two text snippets holds if the claim present in snippet B can be concludedfrom snippet A. The

Contradiction relation applies when the claim in A and the claim in B cannot besimultaneously true. The

Unknown relation applies if A and B neither entail nor contradict each other.The RTE-3 benchmark dataset is the ﬁrst resource that labels paired text snippets in terms of 3-wayRTE judgments (De Marneffe et al., 2008), but it is comprised of general newswire texts. Similarly,the new large annotated corpus used for deep models for entailment (Bowman et al., 2015) labeled textpairs as Contradiction are too broadly deﬁned, i.e., expressing generic semantic incoherence rather thansemantically motivated polarization and mismatch that we are after, which questions its utility in therumor veriﬁcation context.As far as contradiction processing is concerned, accounting for negation in RTE is the focus of arecent study (Madhumita, 2016), but it is still set according to the binary RTE setup. A standalonecontradiction detection system was implemented by (De Marneffe et al., 2008), using complex rule-based features. A speciﬁc RTE application, the Excitement Open Platform (Pad´o et al., 2015) hasbeen developed to provide a generic platform for applied RTE. It integrates several entailment decisionalgorithms, while only the Maximum Entropy-based model (Wang and Neumann, 2007) is available for3-way RTE classiﬁcation. This model implements state-of-the-art linguistic preprocessing augmentedwith lexical resources (WordNet, VerbOcean), and uses the output of part-of-speech and dependencyparsing in its structure-oriented, overlap-based approach for classiﬁcation and was tested for both ourtasks as explained in (Lendvai et al., 2016b). Stance detection

Stance classiﬁcation and stance-labeled corpora are relevant for contradiction detec-tion, because the relationship of two texts expressing opposite stance (positive and negative) can in somecontexts be judged to be contradictory: this is exactly what our Disagreeing reply scenario covers. Stanceclassiﬁcation for rumors was introduced by (Qazvinian et al., 2011) where the goal was to generate a bi-nary (for or against) stance judgment. Stance is typically classiﬁed on the level of individual tweets:reported approaches predominantly utilize statistical models, involving supervised machine learning (deMarneffe et al., 2012) and RTE (Ferreira and Vlachos, 2016). Another relevant aspect of stance detectionfor our current study is the presence of the stance target in the text to be stance-labeled. A recent sharedtask on social media data deﬁned separate challenges depending on whether target-speciﬁc training datais included in the task or not (Mohammad et al., 2016); the latter requires additional effort to encodeinformation about the stance target, cf. e.g. (Augenstein et al., 2016). The PHEME project released anew stance-labeled social media dataset (Zubiaga et al., 2015) that we also utilize as described next.

The two datasets corresponding to our two tasks are drawn from a freely available, annotated socialmedia corpus that was collected from the Twitter platform via ﬁltering on event-related keywords andhashtags in the Twitter Streaming API. We worked with English tweets related to four events: the Ottawashooting , the Sydney Siege , the Germanwings crash , and the Charlie Hebdo shooting . Each event in http://hltfbk.github.io/Excitement-Open-Platform https://ﬁgshare.com/articles/PHEME rumour scheme dataset journalism use case/2068650 twitter.com https://en.wikipedia.org/wiki/2014 shootings at Parliament Hill, Ottawa https://en.wikipedia.org/wiki/2014 Sydney hostage crisis https://en.wikipedia.org/wiki/Germanwings Flight 9525 https://en.wikipedia.org/wiki/Charlie Hebdo shooting vent ENT CON UNK chebdo 143 34 486 36 736 647 427 866 27 199gwings 39 6 107 13 176 461 257 447 4 29ottawa 79 37 292 28 465 555 377 168 18 125ssiege 112 59 456 37 697 332 317 565 21 143373 136 1341 114 2074 1995 1378 2046 70 496 Table 1:

Threads (left) and iPosts (right) RTE datasets compiled from 4 crisis events: amount of pairs perentailment type (

ENT, CON, UNK ), amount of unique rumorous claims ( ) used for creatingthe pairs, amount of unique tweets discussing these claims ( ).the corpus was pre-annotated as explained in (Zubiaga et al., 2015) for several rumorous claims – ofﬁ-cially not yet conﬁrmed statements lexicalized by a concise proposition, e.g. ”Four cartoonists were killedin the Charlie Hebdo attack” and ”French media outlets to be placed under police protection” . The corpus collec-tion method was based on a retweet threshold, therefore most tweets originate from authoritative sourcesusing relatively well-formed language, whereas replying tweets often feature non-standard language use.Tweets are organized into threaded conversations in the corpus and are marked up with respect tostance, certainty, evidentiality, and other veracity-related properties; for full details on released data werefer to (Zubiaga et al., 2015). The dataset on which we run disagreeing reply detection (henceforth: Threads ) was converted by us to RTE format based on the threaded conversations labeled in this corpus.We created the Threads RTE dataset drawing on manually pre-assigned Response Type labels by (Zu-biaga et al., 2015) that were meant to characterize source tweet – replying tweet relations in terms offour categories. We mapped these four categories onto three RTE labels: a reply pre-labeled as

Agreed with respect to its source tweet was mapped to

Entailment , a reply pre-labeled as

Disagreed was mappedto

Contradiction , while replies pre-labeled as

AppealforMoreInfo and

Comment were mapped to

Un-known . Only direct replies to source tweets relating to the same four events as in the independent postsRTE dataset were kept. There are 1,850 tweet pairs in this set; the proportion of contradiction instancesamounts to 7%. The

Threads dataset holds

CON , ENT and

UNK pairs as exempliﬁed below. Conformthe RTE format, pair elements are termed text and hypothesis – note that directionality between t and h is assumed as symmetric in our current context so t and h are assigned based on token-level length. • CON < t > We understand there are two gunmen and up to a dozen hostages inside the cafe under siege atSydney.. ISIS ﬂags remain on display 7News < /t > < h > not ISIS ﬂags < /h > • ENT < t > Report: Co-Pilot Locked Out Of Cockpit Before Fatal Plane Crash URL Germanwings URL < /t >< h > This sounds like pilot suicide. < /h > • UNK < t > BREAKING NEWS: At least 3 shots ﬁred at Ottawa War Memorial. One soldier conﬁrmed shot -URL URL < /t > < h > All our domestic military should be armed, now. < /h > .The independently posted tweets dataset (henceforth: iPosts ) that we used for contradiction detectionbetween independently emerging claim-initiating tweets is described in (Lendvai et al., 2016a). Thiscollection is holds 5.4k RTE pairs generated from about 500 English tweets using semi-automatic 3-wayRTE labeling, based on semantic or numeric mismatches between the rumorous claims annotated in thedata. The proportion of contradictory pairs ( CON ) amounts to 25%. The two collections are quantiﬁedin Table 1. iPosts dataset examples are given below. • CON < t >

12 people now known to have died after gunmen stormed the Paris HQ of magazine CharlieHebdoURL URL < /t > < h > Awful. 11 shot dead in an assault on a Paris magazine. URL CharlieHebdo URL < /h > • ENT < t > SYDNEY ATTACK - Hostages at Sydney cafe - Up to 20 hostages - Up to 2 gunmen - Hostagesseen holding ISIS ﬂag DEVELOPING.. < /t > < h > Up to 20 held hostage in Sydney Lindt Cafe siege URLURL < /h > • UNK < t > BREAKING: NSW police have conﬁrmed the siege in Sydney’s CBD is now over, a police ofﬁceris reportedly among the several injured. < /t > < h > Update: Airspace over Sydney has been shut down. Livecoverage: URL sydneysiege < /h > . Rumor , rumorous claim and claim are used interchangeably throughout the paper to refer to the same concept. Text similarity features

Data preprocessing on both datasets included screen name and hashtag sign removal and URL mask-ing. Then, for each tweet pair we extracted vocabulary overlap and local text alignment features. Thetweets were part-of-speech-tagged using the Balloon toolkit (Reichel, 2012) (PENN tagset, (Marcus etal., 1999)), normalized to lowercase and stemmed using an adapted version of the Porter stemmer (Porter,1980). Content words were deﬁned to belong to the set of nouns, verbs, adjectives, adverbs, and numbers,and were identiﬁed by their part of speech labels. All punctuation was removed.

Vocabulary overlap was calculated for content word stem types in terms of the Cosine similarity and theF1 score. The Cosine similarity of two tweets is deﬁned as C ( X, Y ) = | X ∩ Y | √ | X |·| Y | , where X and Y denotethe sets of content word stems in the tweet pair.The F1 score is deﬁned as the harmonic mean of precision and recall. Precision and recall here referto covering the vocabulary X of one tweet by the vocabulary Y of another tweet (or vice versa). It isgiven by F · | X ∩ Y || X | · | X ∩ Y || Y || X ∩ Y || X | + | X ∩ Y || Y | . Again the vocabularies X and Y consist of stemmed content words.Just like the Cosine index, the F1 score is a symmetric similarity metric.These two metrics are additionally applied to the content word POS label inventories within the tweetpair, which gives the four features cosine, cosine pos, f score , and f score pos , respectively. The amount of stemmed word token overlap was measured by applying local alignment of the tokensequences using the Smith-Waterman algorithm (Smith and Waterman, 1981). We chose a score functionrewarding zero substitutions by +1 , and punishing insertions, deletions, and substitutions each by -reset.Having ﬁlled in the score matrix H , alignment was iteratively applied the following way: while max( H ) ≥ t – trace back from the cell containing this maximum the path leading to it until a zero-cell is reached– add the substring collected on this way to the set of aligned substrings– set all traversed cells to 0. The threshold t deﬁnes the required minimum length of aligned substrings. It is set to 1 in this study,thus it supports a complete alignment of any pair of permutations of x . The traversed cells are set to after each iteration step to prevent that one substring would be related to more than one alignment pair.This approach would allow for two restrictions: to prevent cross alignment not just the traversed cells [ i, j ] but for each of these cells its entire row i and column j needs to be set to 0. Second, if only thelongest common substring is of interest, then the iteration is trivially to be stopped after the ﬁrst step.Since we did not make use of these restrictions, in our case the alignment supports cross-dependenciesand can be regarded as an iterative application of a longest common substring match.From the substring pairs in tweets x and y aligned this way, we extracted two text similarity measures: • laProp : the proportion of locally aligned tokens over both tweets m ( x )+ m ( y ) n ( x )+ n ( y ) • laPropS : the proportion of aligned tokens in the shorter tweet m (ˆ z ) n (ˆ z ) , ˆ z = arg min z ∈{ x,y } [ n ( z )] ,where n ( z ) denotes the number of all tokens and m ( z ) the number of aligned tokens in tweet z . Figures 2 and 3 show the distribution of the features introduced above each for a selected event in bothdatasets. Each ﬁgure half represents a dataset; each subplot shows the distribution of a feature in depen-dence of the three RTE classes for the selected event in that dataset.The plots indicate a general trend over all events and datasets: the similarity features reach highestvalues for the ENT class, followed by CON and UNK. Kruskal-Wallis tests applied separately for allcombinations of features, events and datasets conﬁrmed these trends, revealing signiﬁcant differences forall boxplot triplets ( p < . after correction for type 1 errors in this high amount of comparisons usinghe false discovery rate method of (Benjamini and Yekutieli, 2001)). Dunnett post hoc tests howeverclariﬁed that for 16 out of 72 comparisons (all POS similarity measures) only UNK but not ENT andCON differ signiﬁcantly ( α = 0 . ). Both datasets contain the same amount of non-signiﬁcant cases.Nevertheless, these trends are encouraging to test whether an RTE task can be addressed by string andPOS-level similarity features alone, without syntactic or semantic level tweet comparison.Figure 2: Distributions of the similarity metrics by tweet pair class for the event chebdo in the Threads ( left ) and the iPosts dataset ( right). Figure 3: Distributions of the similarity metrics by tweet pair class for the event ssiege in the

Threads ( left ) and the iPosts dataset ( right). RTE classiﬁcation experiments for Contradiction and Disagreeing Reply detection

In order to predict the RTE classes based on the features introduced above, we trained two classiﬁers:Nearest (shrunken) centroids (NC) (Tibshirani et al., 2003) and Random forest (RF) (Breiman, 2001;Liaw and Wiener, 2002), using the R wrapper package

Caret (Kuhn, 2016) with the methods pam and rf , respectively. To derive the same number of instances for all classes, we applied separately for bothdatasets resampling without replacement, so that the total data amounts about 4,550 feature vectorsequally distributed over the three classes, the majority of 4,130 belonging to the iPosts data set. Fur-ther, we centered and scaled the feature matrix. Within the Caret framework we optimized the tunableparameters of both classiﬁers by maximizing the F1 score. This way the NC shrinkage delta was setto 0, which means that the class reference centroids are not modiﬁed. For RF the number of variablesrandomly sampled as candidates at each split was set to 2. The remaining parameters were kept default.The classiﬁers were tested on both datasets in a 4-fold event-based held-out setting, training on threeevents and testing on the remaining one (4-fold cross-validation, CV), quantifying how performance gen-eralizes to new events with unseen claims and unseen targets. The CV scores are summarized in Tables 2and 3. It turns out generally that classifying CON is more difﬁcult than classifying ENT or UNK. We ob-serve a dependency of the classiﬁer performances on the two contradiction scenarios: for detecting CON,RF achieved higher classiﬁcation values on Threads, whereas NC performed better on iPosts. Generalperformance across all three classes was better in independent posts than in conversational threads.Deﬁnitions of contradiction, the genre of texts and the features used are dependent on end applications,making performance comparison nontrivial (Lendvai et al., 2016b). On a different subset of the Threadsdata in terms of events, size of evidence, 4 stance classes and no resampling, (Lukasik et al., 2016) report.40 overall F-score using Gaussian processes, cosine similarity on text vector representation and tempo-ral metadata. Our previous experiments were done using the Excitement Open Platform incorporatingsyntactico-semantic processing and 4-fold CV. For the non-resampled Threads data we reported .11 F1on CON via training on iPosts (Lendvai et al., 2016b). On the non-resampled iPosts data we obtained.51 overall F1 score (Lendvai et al., 2016a), F1 on CON being .25 (Lendvai et al., 2016b).CON ENT UNKF1 (RF/ NC ) 0.33/ wgt prec. 0.51/0.55wgt rec. 0.47/0.51Table 2: iPosts dataset. Mean and weighted (wgt) mean results on held-out data after event held-outcross validation for the Random Forest (RF) and Nearest Centroid (NC) classiﬁers.CON ENT UNKF1 ( RF /NC) /0.11 0.45/0.50 0.40/0.36precision 0.42/0.07 0.52/0.56 0.34/0.31recall 0.35/0.20 0.41/0.47 0.50/0.61accuracy 0.42/0.39wgt F1 /0.32wgt prec. 0.47/0.33wgt rec. 0.42/0.39Table 3: Threads dataset. Mean and weighted (wgt) mean results on held-out data after event held-outcross validation for the Random forest and Nearest Centroid classiﬁers (RF/NC).e proposed to model two types of contradictions: in the ﬁrst both tweets encode the claim target(iPosts), in the second typically only one of them (Threads). The Nearest Centroid algorithm performspoorly on the CON class in Threads where textual overlap is typically small especially for the CON andUNK classes, in part due to the absence of the claim target in replies. However, the Random Forestalgorithm’s performance is not affected by this factor. The advantage of RF on the Threads data canbe explained by its property of training several weak classiﬁers on parts of the feature vectors only.By this boosting strategy a usually undesirable combination of relatively long feature vectors but fewtraining observations can be tackled, holding for the Threads data that due to its extreme skewedness (cf.Table 1) shrunk down to only 420 datapoints after our class balancing technique of resampling withoutreplacement. Results indicate the beneﬁt of RF classiﬁers in such sparse data cases.The good performance of NC on the much larger amount of data in iPosts is in line with the corpusstatistics reported in section 4.3, implying a reasonably small amount of class overlap. The classes arethus relatively well represented by their centroids, which is exploited by the NC classiﬁer. However, asillustrated in Figures 2 and 3, the majority of feature distributions are generally better separated for ENTand UNK, while CON in its mid position shows more overlap to both other classes and is thus overall aless distinct category.

The detection of contradiction and disagreement in microposts supplies important cues to factuality andveracity assessment, and is a central task in computational journalism. We developed classiﬁers in auniform, general inference framework that differentiates two tasks based on contextual proximity of thetwo posts to be assessed, and if the claim target may or may not be omitted in their content. We utilizedsimple text similarity metrics that proved to be a good basis for contradiction classiﬁcation.Text similarity was measured in terms of vocabulary and token sequence overlap. To derive the latter,local alignment turned out to be a valuable tool: as opposed to standard global alignment (Wagnerand Fischer, 1974), it can account for crossing dependencies and thus for varying sequential order ofinformation structure in entailing text pairs, e.g. in ”the cat chased the mouse” and ”the mouse waschased by the cat”, which are differently structured into topic and comment (Halliday, 1967). We expectcontradictory content to exhibit similar trends in variation with respect to content unit order – especiallyin the Threads scenario, where entailment inferred from a reply can become the topic of a subsequentreplying tweet. Since local alignment can resolve such word order differences, it is able to preserve textsimilarity of entailing tweet pairs, which is reﬂected in the relative laProp boxplot heights in Figures 2and 3.We have run leave-one-event-out evaluation separately on the independent posts data and on the con-versational threads data, which allowed us to compare performances on collections originating from thesame genre and platform, but on content where claim targets in the test data are different from the targetsin the training data. Our obtained generalization performance over unseen events turns out to be in linewith previous reports. Via downsampling, we achieved a balanced performance on both tasks across thethree RTE classes; however, in line with previous work, even in this setup the overall performance oncontradiction is the lowest, whereas detecting the lack of contradiction can be achieved with much betterperformance in both contradiction scenarios.Possible extensions to our approach include incorporating more informed text similarity metrics (B¨aret al., 2012), formatting phenomena (Tolosi et al., 2016), and distributed contextual representations (Leand Mikolov, 2014), the utilization of knowledge-intensive resources (Pad´o et al., 2015), representationof alignment on various content levels (Noh et al., 2015), and formalization of contradiction scenarios interms of additional layers of perspective (van Son et al., 2016).

P. Lendvai was supported by the PHEME FP7 project (grant nr. 611233), U. D. Reichel by an Alexandervon Humboldt Society grant. We thank anonymous reviewers for their input. eferences [Agirre et al.2016] Eneko Agirre, Aitor Gonzalez-Agirre, Inigo Lopez-Gazpio, Montse Maritxalar, German Rigau,and Larraitz Uria. 2016. Semeval-2016 task 2: Interpretable semantic textual similarity.

Proceedings ofSemEval , pages 512–524.[Augenstein et al.2016] Isabelle Augenstein, Andreas Vlachos, and Kalina Bontcheva. 2016. USFD: Any-TargetStance Detection on Twitter with Autoencoders. In Saif M. Mohammad, Svetlana Kiritchenko, Parinaz Sobhani,Xiaodan Zhu, and Colin Cherry, editors,

Proceedings of the International Workshop on Semantic Evaluation ,SemEval ’16, San Diego, California.[B¨ar et al.2012] Daniel B¨ar, Chris Biemann, Iryna Gurevych, and Torsten Zesch. 2012. UKP: Computing semantictextual similarity by combining multiple content similarity measures. In

Proceedings of the First Joint Confer-ence on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the sharedtask, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation , pages 435–440.Association for Computational Linguistics.[Benjamini and Yekutieli2001] Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery ratein multiple testing under dependency.

Annals of Statistics , 29:1165–1188.[Bowman et al.2015] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015.A large annotated corpus for learning natural language inference. In

Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing , pages 632–642, Lisbon, Portugal, September. Associationfor Computational Linguistics.[Breiman2001] Leo Breiman. 2001. Random forests.

Machine Learning , 45(1):5–32.[Dagan et al.2006] Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textualentailment challenge. In

Machine learning challenges. evaluating predictive uncertainty, visual object classiﬁ-cation, and recognising tectual entailment , pages 177–190. Springer.[De Marneffe et al.2008] Marie-Catherine De Marneffe, Anna N Rafferty, and Christopher D Manning. 2008.Finding contradictions in text. In

Proc. of ACL , volume 8, pages 1039–1047.[de Marneffe et al.2012] Marie-Catherine de Marneffe, Christopher D Manning, and Christopher Potts. 2012. Didit happen? the pragmatic complexity of veridicality assessment.

Computational Linguistics , 38(2):301–333.[Ferreira and Vlachos2016] William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stanceclassiﬁcation. In

Proceedings of NAACL .[Halliday1967] Michael Alexander Kirkwood Halliday. 1967. Notes on transitivity and theme in English, part II.

Journal of Linguistics , 3(2):199–244.[Kuhn2016] Max Kuhn, 2016. caret: Classiﬁcation and Regression Training . R package version 6.0-71.[Le and Mikolov2014] Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and docu-ments. In

ICML , volume 14, pages 1188–1196.[Lendvai et al.2016a] Piroska Lendvai, Isabelle Augenstein, Kalina Bontcheva, and Thierry Declerck. 2016a.Monolingual social media datasets for detecting contradiction and entailment. In

Proc. of LREC-2016 .[Lendvai et al.2016b] Piroska Lendvai, Isabelle Augenstein, Dominic Rout, Kalina Bontcheva, and Thierry De-clerck. 2016b. Algorithms for Detecting Disputed Information. Deliverable D4.2.2 for FP7-ICT Collabo-rative Project ICT-2013-611233 PHEME. .[Liaw and Wiener2002] Andy Liaw and Matthew Wiener. 2002. Classiﬁcation and regression by randomForest.

RNews , 2(3):18–22.[Lukasik et al.2016] Michal Lukasik, P.K. Srijith, Duy Vu, Kalina Bontcheva, Arkaitz Zubiaga, and Trevor Cohn.2016. Hawkes Processes for Continuous Time Sequence Classiﬁcation: An Application to Rumour StanceClassiﬁcation in Twitter. In

Proceedings of ACL-16 .[Madhumita2016] Madhumita. 2016. Recognizing textual entailment. Master’s thesis, Saarland University,Saarbr¨ucken, Germany.Marcus et al.1999] Mitchell P. Marcus, Ann Taylor, Robert MacIntyre, Ann Bies, Constance Cooper, MarkFerguson, and Alison Littman. 1999. The Penn Treebank Project. . visited on Sep 29th 2016.[Mendoza et al.2010] Marcelo Mendoza, Barbara Poblete, and Carlos Castillo. 2010. Twitter Under Crisis: CanWe Trust What We RT? In

Proceedings of the First Workshop on Social Media Analytics (SOMA’2010) , pages71–79, New York, NY, USA. ACM.[Mohammad et al.2016] Saif M. Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and ColinCherry. 2016. SemEval-2016 Task 6: Detecting stance in tweets. In

Proceedings of the International Workshopon Semantic Evaluation , SemEval ’16, San Diego, California.[Morante and Blanco2012] Roser Morante and Eduardo Blanco. 2012. *SEM 2012 shared task: Resolving thescope and focus of negation. In

Proceedings of the First Joint Conference on Lexical and ComputationalSemantics .[Morante and Sporleder2012] Roser Morante and Caroline Sporleder, editors. 2012.

ExProm ’12: Proceedings ofthe ACL-2012 Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics . Associationfor Computational Linguistics.[Noh et al.2015] Tae-Gil Noh, Sebastian Pad´o, Vered Shwartz, Ido Dagan, Vivi Nastase, Kathrin Eichler, LiliKotlerman, and Meni Adler. 2015. Multi-level alignments as an extensible representation basis for textualentailment algorithms.

Lexical and Computational Semantics (* SEM 2015) , page 193.[Pad´o et al.2015] Sebastian Pad´o, Tae-Gil Noh, Asher Stern, Rui Wang, and Roberto Zanoli. 2015. Design andRealization of a Modular Architecture for Textual Entailment.

Natural Language Engineering , 21(02):167–200.[Porter1980] Martin F. Porter. 1980. An algorithm for sufﬁx stripping.

Program , 14(3):130–137.[Qazvinian et al.2011] Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumorhas it: Identifying misinformation in microblogs. In

Proceedings of the Conference on Empirical Methods inNatural Language Processing , EMNLP ’11, pages 1589–1599.[Reichel2012] Uwe D. Reichel. 2012. PermA and Balloon: Tools for string alignment and text processing. In

Proc. Interspeech , page paper no. 346, Portland, Oregon, USA.[Smith and Waterman1981] Temple F. Smith and Michael S. Waterman. 1981. Identiﬁcation of common molecularsubsequences.

Journal of Molecular Biology , 147:195–197.[Tibshirani et al.2003] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. 2003.Class prediction by nearest shrunken centroids,with applications to DNA microarrays.

Statistical Science ,18(1):104–117.[Tolosi et al.2016] Laura Tolosi, Andrey Tagarev, and Georgi Georgiev. 2016. An analysis of event-agnostic fea-tures for rumour classiﬁcation in twitter. In

Proc. of Social Media in the Newsroom Workshop .[van Son et al.2016] Chantal van Son, Tommaso Caselli, Antske Fokkens, Isa Maks, Roser Morante, Lora Aroyo,and Piek Vossen. 2016. GRaSP: A Multilayered Annotation Scheme for Perspectives. In

Proceedings of the10th Edition of the Language Resources and Evaluation Conference (LREC) .[Wagner and Fischer1974] Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction prob-lem.

Journal of the Association for Computing Machinery , 21(1):168–173.[Wang and Neumann2007] Rui Wang and G¨unter Neumann. 2007. Recognizing textual entailment using a subse-quence kernel method. In

AAAI , volume 7, pages 937–945.[Xu et al.2015] Wei Xu, Chris Callison-Burch, and William B Dolan. 2015. SemEval-2015 Task 1: Paraphraseand semantic similarity in Twitter (PIT).

Proceedings of SemEval .[Zubiaga et al.2015] Arkaitz Zubiaga, Maria Liakata, Rob Procter, Kalina Bontcheva, and Peter Tolmie. 2015.Towards Detecting Rumours in Social Media.