[PDF] Trouble on the Horizon: Forecasting the Derailment of Online Conversations as they Develop

Abstract

Online discussions often derail into toxic exchanges between participants. Recent efforts mostly focused on detecting antisocial behavior after the fact, by analyzing single comments in isolation. To provide more timely notice to human moderators, a system needs to preemptively detect that a conversation is heading towards derailment before it actually turns toxic. This means modeling derailment as an emerging property of a conversation rather than as an isolated utterance-level event. Forecasting emerging conversational properties, however, poses several inherent modeling challenges. First, since conversations are dynamic, a forecasting model needs to capture the flow of the discussion, rather than properties of individual comments. Second, real conversations have an unknown horizon: they can end or derail at any time; thus a practical forecasting model needs to assess the risk in an online fashion, as the conversation develops. In this work we introduce a conversational forecasting model that learns an unsupervised representation of conversational dynamics and exploits it to predict future derailment as the conversation develops. By applying this model to two new diverse datasets of online conversations with labels for antisocial events, we show that it outperforms state-of-the-art systems at forecasting derailment.

Full PDF

TTrouble on the Horizon:Forecasting the Derailment of Online Conversations as they Develop

Jonathan P. Chang

Cornell University [email protected]

Cristian Danescu-Niculescu-Mizil

Cornell University [email protected]

Abstract

Online discussions often derail into toxic ex-changes between participants. Recent effortsmostly focused on detecting antisocial behav-ior after the fact, by analyzing single com-ments in isolation. To provide more timelynotice to human moderators, a system needsto preemptively detect that a conversation isheading towards derailment before it actuallyturns toxic. This means modeling derailmentas an emerging property of a conversationrather than as an isolated utterance-level event.Forecasting emerging conversational proper-ties, however, poses several inherent modelingchallenges. First, since conversations are dy-namic, a forecasting model needs to capturethe ﬂow of the discussion, rather than proper-ties of individual comments. Second, real con-versations have an unknown horizon: they canend or derail at any time; thus a practical fore-casting model needs to assess the risk in an on-line fashion, as the conversation develops. Inthis work we introduce a conversational fore-casting model that learns an unsupervised rep-resentation of conversational dynamics and ex-ploits it to predict future derailment as the con-versation develops. By applying this model totwo new diverse datasets of online conversa-tions with labels for antisocial events, we showthat it outperforms state-of-the-art systems atforecasting derailment. “Ch´e saetta previsa vien pi`u lenta.” – Dante Alighieri, Divina Commedia, Paradiso Antisocial behavior is a persistent problemplaguing online conversation platforms; it is bothwidespread (Duggan, 2014) and potentially dam-aging to mental and emotional health (Raskauskasand Stoltz, 2007; Akbulut et al., 2010). The strainthis phenomenon puts on community maintainers “The arrow one foresees arrives more gently.” (1) [User A] What does [quote omitted] refer to? I as-sume it should be written from June 2010 to December2011 and we should precise [sic] the months.(2) [User B]

No. It refers to 2007-2011 Belgian politicalcrisis(3) [User A] [User B]

Yes it’s not ridiculous at all to claim [it’soriginal research] because it doesn’t ﬁt your argument.A crisis can be composed out of several smaller crisis.It’s not original research if some of your sources onlytalk about parts [...](5) [User A]

Where is the source that claim the crisis is4 year long? Sources state claim it is 18 month long andrefer to the period from June 2010 to December 2011.(6) [User B]

There were 4 governments and 2 years ofno government in 4 years time. You can not sanely claimthat this Must be viewed as two seperate crisis. Whatexactly splits them up? [...]

Figure 1: Example start of a conversation that willeventually derail into a personal attack. has sparked recent interest in computational ap-proaches for assisting human moderators.Prior work in this direction has largely focusedon post-hoc identiﬁcation of various kinds of an-tisocial behavior, including hate speech (Warnerand Hirschberg, 2012; Davidson et al., 2017), ha-rassment (Yin et al., 2009), personal attacks (Wul-czyn et al., 2017), and general toxicity (Pavlopou-los et al., 2017). The fact that these approachesonly identify antisocial content after the fact limitstheir practicality as tools for assisting pre-emptive moderation in conversational domains.Addressing this limitation requires forecasting the future derailment of a conversation based onearly warning signs, giving the moderators timeto potentially intervene before any harm is done(Liu et al. 2018, Zhang et al. 2018a, see Jurgenset al. 2019 for a discussion). Such a goal rec-ognizes derailment as emerging from the devel- a r X i v : . [ c s . C L ] S e p pment of the conversation, and belongs to thebroader area of conversational forecasting , whichincludes future-prediction tasks such as predictingthe eventual length of a conversation (Backstromet al., 2013), whether a persuasion attempt willeventually succeed (Tan et al., 2016; Wachsmuthet al., 2018; Yang et al., 2019), whether team dis-cussions will eventually lead to an increase in per-formance (Niculae and Danescu-Niculescu-Mizil,2016), or whether ongoing counseling conversa-tions will eventually be perceived as helpful (Al-thoff et al., 2016). Approaching such conversational forecasting problems, however, requires overcoming severalinherent modeling challenges. First, conversa-tions are dynamic and their outcome might dependon how subsequent comments interact with eachother. Consider the example in Figure 1: whileno individual comment is outright offensive, a hu-man reader can sense a tension emerging fromtheir succession (e.g., dismissive answers to re-peated questioning). Thus a forecasting modelneeds to capture not only the content of each in-dividual comment, but also the relations betweencomments. Previous work has largely relied onhand-crafted features to capture such relations—e.g., similarity between comments (Althoff et al.,2016; Tan et al., 2016) or conversation structure(Zhang et al., 2018b; Hessel and Lee, 2019)—,though neural attention architectures have also re-cently shown promise (Jo et al., 2018).The second modeling challenge stems from thefact that conversations have an unknown horizon :they can be of varying lengths, and the to-be-forecasted event can occur at any time. So whenis it a good time to make a forecast? Prior workhas largely proposed two solutions, both resultingin important practical limitations. One solution isto assume (unrealistic) prior knowledge of whenthe to-be-forecasted event takes place and extractfeatures up to that point (Niculae et al., 2015; Liuet al., 2018). Another compromising solution is toextract features from a ﬁxed-length window, of-ten at the start of the conversation (Curhan andPentland, 2007; Niculae and Danescu-Niculescu-Mizil, 2016; Althoff et al., 2016; Zhang et al., We can distinguish two types of forecasting tasks, de-pending on whether the to-be-forecasted target is an eventthat might take place within the conversation (e.g., derail-ment) or an outcome measured after the conversation willeventually conclude (e.g., helpfulness). The following dis-cussion of modeling challenges holds for both. early detection.In this work we introduce a model for fore-casting conversational events that overcomes boththese inherent challenges by processing com-ments, and their relations, as they happen (i.e., inan online fashion). Our main insight is that modelswith these properties already exist, albeit gearedtoward generation rather than prediction: recentwork in context-aware dialog generation (or “chat-bots”) has proposed sequential neural models thatmake effective use of the intra-conversational dy-namics (Sordoni et al., 2015b; Serban et al., 2016,2017), while concomitantly being able to processthe conversation as it develops (see Gao et al.(2018) for a survey).In order for these systems to perform well in thegenerative domain they need to be trained on mas-sive amounts of (unlabeled) conversational data.The main difﬁculty in directly adapting these mod-els to the supervised domain of conversationalforecasting is the relative scarcity of labeled data:for most forecasting tasks, at most a few thousandslabeled examples are available, insufﬁcient for thenotoriously data-hungry sequential neural models.To overcome this difﬁculty, we propose to de-couple the objective of learning a neural repre-sentation of conversational dynamics from the ob-jective of predicting future events. The formercan be pre-trained on large amounts of unsuper-vised data, similarly to how chatbots are trained.The latter can piggy-back on the resulting repre-sentation after ﬁne-tuning it for classiﬁcation us-ing relatively small labeled data. While similarpre-train-then-ﬁne-tune approaches have recentlyachieved state-of-the-art performance in a numberof NLP tasks—including natural language infer-ence, question answering, and commonsense rea-soning (discussed in Section 2)—to the best of ourknowledge this is the ﬁrst attempt at applying thisparadigm to conversational forecasting.To test the effectiveness of this new architecturein forecasting derailment of online conversations,we develop and distribute two new datasets. Therst triples in size the highly curated ‘Conversa-tions Gone Awry’ dataset (Zhang et al., 2018a),where civil-starting Wikipedia Talk Page conver-sations are crowd-labeled according to whetherthey eventually lead to personal attacks; the sec-ond relies on in-the-wild moderation of the pop-ular subreddit ChangeMyView, where the aim isto forecast whether a discussion will later be sub-ject to moderator action due to “rude or hostile”behavior. In both datasets, our model outperformsexisting ﬁxed-window approaches, as well as sim-pler sequential baselines that cannot account forinter-comment relations. Furthermore, by virtueof its online processing of the conversation, oursystem can provide substantial prior notice of up-coming derailment, triggering on average 3 com-ments (or 3 hours) before an overtly toxic com-ment is posted.To summarize, in this work we: • introduce the ﬁrst model for forecasting con-versational events that can capture the dy-namics of a conversation as it develops ; • build two diverse datasets (one entirely new,one extending prior work) for the task of fore-casting derailment of online conversations; • compare the performance of our modelagainst the current state-of-the-art, and evalu-ate its ability to provide early warning signs.Our work is motivated by the goal of assist-ing human moderators of online communities bypreemptively signaling at-risk conversations thatmight deserve their attention. However, we cau-tion that any automated systems might encode oreven amplify the biases existing in the trainingdata (Park et al., 2018; Sap et al., 2019; Wiegandet al., 2019), so a public-facing implementationwould need to be exhaustively scrutinized for suchbiases (Feldman et al., 2015). Antisocial behavior.

Antisocial behavior onlinecomes in many forms, including harassment (Vi-tak et al., 2017), cyberbullying (Singh et al., 2017),and general aggression (Kayany, 1998). Priorwork has sought to understand different aspects ofsuch behavior, including its effect on the commu-nities where it happens (Collier and Bear, 2012;Arazy et al., 2013), the actors involved (Cheng et al., 2017; Volkova and Bell, 2017; Kumar et al.,2018; Ribeiro et al., 2018) and connections to theoutside world (Olteanu et al., 2018).

Post-hoc classiﬁcation of conversations.

Thereis a rich body of prior work on classifying the out-come of a conversation after it has concluded, orclassifying conversational events after they hap-pened. Many examples exist, but some moreclosely related to our present work include iden-tifying the winner of a debate (Zhang et al., 2016;Potash and Rumshisky, 2017; Wang et al., 2017),identifying successful negotiations (Curhan andPentland, 2007; Cadilhac et al., 2013), as well asdetecting whether deception (Girlea et al., 2016;P´erez-Rosas et al., 2016; Levitan et al., 2018) ordisagreement (Galley et al., 2004; Abbott et al.,2011; Allen et al., 2014; Wang and Cardie, 2014;Rosenthal and McKeown, 2015) has occurred.Our goal is different because we wish to fore-cast conversational events before they happen andwhile the conversation is still ongoing (potentiallyallowing for interventions). Note that some post-hoc tasks can also be re-framed as forecastingtasks (assuming the existence of necessary labels);for instance, predicting whether an ongoing con-versation will eventually spark disagreement (Hes-sel and Lee, 2019), rather than detecting already-existing disagreement.

Conversational forecasting.

As described in Sec-tion 1, prior work on forecasting conversationaloutcomes and events has largely relied on hand-crafted features to capture aspects of conversa-tional dynamics. Example feature sets include sta-tistical measures based on similarity between ut-terances (Althoff et al., 2016), sentiment imbal-ance (Niculae et al., 2015), ﬂow of ideas (Nicu-lae et al., 2015), increase in hostility (Liu et al.,2018), reply rate (Backstrom et al., 2013) andgraph representations of conversations (Garimellaet al., 2017; Zhang et al., 2018b). By contrast, weaim to automatically learn neural representationsof conversational dynamics through pre-training.Such hand-crafted features are typically ex-tracted from ﬁxed-length windows of the conver-sation, leaving unaddressed the problem of un-known horizon. While some work has trained multiple models for different window-lengths (Liuet al., 2018; Hessel and Lee, 2019), they considerthese models to be independent and, as such, donot address the issue of aggregating them into asingle forecast (i.e., deciding at what point to make prediction). We implement a simple sliding win-dows solution as a baseline (Section 5).

Pre-training for NLP.

The use of pre-training fornatural language tasks has been growing in pop-ularity after recent breakthroughs demonstratingimproved performance on a wide array of bench-mark tasks (Peters et al., 2018; Radford et al.,2018). Existing work has generally used a lan-guage modeling objective as the pre-training ob-jective; examples include next-word prediction(Howard and Ruder, 2018), sentence autoencod-ing, (Dai and Le, 2015), and machine transla-tion (McCann et al., 2017). BERT (Devlin et al.,2019) introduces a variation on this in which thegoal is to predict the next sentence in a documentgiven the current sentence. Our pre-training ob-jective is similar in spirit, but operates at a con-versation level, rather than a document level. Wehence view our objective as conversational model-ing rather than (only) language modeling. Further-more, while BERT’s sentence prediction objectiveis framed as a multiple-choice task, our objectiveis framed as a generative task.

We consider two datasets, representing relatedbut slightly different forecasting tasks. The ﬁrstdataset is an expanded version of the annotatedWikipedia conversations dataset from Zhang et al.(2018a). This dataset uses carefully-controlledcrowdsourced labels, strictly ﬁltered to ensure theconversations are civil up to the moment of a per-sonal attack. This is a useful property for the pur-poses of model analysis, and hence we focus onthis as our primary dataset. However, we are con-scious of the possibility that these strict labels maynot fully capture the kind of behavior that modera-tors care about in practice. We therefore introducea secondary dataset, constructed from the subred-dit ChangeMyView (CMV) that does not use post-hoc annotations. Instead, the prediction task is toforecast whether the conversation will be subjectto moderator action in the future.

Wikipedia data.

Zhang et al.’s ‘ConversationsGone Awry’ dataset consists of 1,270 conversa-tions that took place between Wikipedia editors onpublicly accessible talk pages. The conversationsare sourced from the WikiConv dataset (Hua et al.,2018) and labeled by crowdworkers as either con-taining a personal attack from within (i.e., hostile behavior by one user in the conversation directedtowards another) or remaining civil throughout.A series of controls are implemented to preventmodels from picking up on trivial correlations. Toprevent models from capturing topic-speciﬁc in-formation (e.g., political conversations are morelikely to derail), each attack-containing conversa-tion is paired with a clean conversation from thesame talk page, where the talk page serves as aproxy for topic. To force models to actually cap-ture conversational dynamics rather than detectingalready-existing toxicity, human annotations areused to ensure that all comments preceding a per-sonal attack are civil.To the ends of more effective model training,we elected to expand the ‘Conversations GoneAwry’ dataset, using the original annotation pro-cedure. Since we found that the original dataskewed towards shorter conversations, we focusedthis crowdsourcing run on longer conversations:ones with 4 or more comments preceding the at-tack. Through this additional crowdsourcing, weexpand the dataset to 4,188 conversations, whichwe are publicly releasing as part of the CornellConversational Analysis Toolkit (ConvoKit). We perform an 80-20-20 train/dev/test split, en-suring that paired conversations end up in the samesplit in order to preserve the topic control. Finally,we randomly sample another 1 million conversa-tions from WikiConv to use for the unsupervisedpre-training of the generative component.

Reddit CMV data.

The CMV dataset is con-structed from conversations collected via theReddit API. In contrast to the Wikipedia-baseddataset, we explicitly avoid the use of post-hoc an-notation. Instead, we use as our label whether aconversation eventually had a comment removedby a moderator for violation of Rule 2: “Don’t berude or hostile to other users”. Though the lack of post-hoc annotation limitsthe degree to which we can impose controls on thedata (e.g., some conversations may contain toxiccomments not ﬂagged by the moderators) we doreproduce as many of the Wikipedia data’s con-trols as we can. Namely, we replicate the topic Paired conversations were also enforced to be similar inlength, so that length distribution is the same between classes. We cap the length at 10 to avoid overwhelming thecrowdworkers. convokit.cornell.edu The existence of this speciﬁc rule, the standardized mod-eration messages and the civil character of the Change-MyView subreddit was our initial motivation for choosing it. et’s ﬁx it

I agree

Utt. Encoder

I don’t

Utt. EncoderUtt. EncoderContext Encoder Predictor MLP

Please explain

Decoder p event Generative (pre-training) objective Prediction objectiveComment 2 Comment 3Comment 1

Figure 2: Sketch of the CRAFT architecture. control pairing by choosing pairs of positive andnegative examples that belong to the same top-level post, following Tan et al. (2016); and en-force that the removed comment was made by auser who was previously involved in the conversa-tion. This process results in 6,842 conversations,to which we again apply a pair-preserving 80-20-20 split. Finally, we gather over 600,000 conver-sations that do not include any removed comment,for unsupervised pre-training.

We now describe our general model for forecast-ing future conversational events. Our model in-tegrates two components: (a) a generative dialogmodel that learns to represent conversational dy-namics in an unsupervised fashion; and (b) a su-pervised component that ﬁne-tunes this represen-tation to forecast future events. Figure 2 providesan overview of the proposed architecture, hence-forth CRAFT (Conversational Recurrent Architec-ture for ForecasTing).

Terminology.

For modeling purposes, we treat aconversation as a sequence of N comments C = { c , . . . , c N } . Each comment, in turn, is a se-quence of tokens, where the number of tokensmay vary from comment to comment. For the n -th comment ( ≤ n ≤ N ) , we let M n de-note the number of tokens. Then, a comment c n can be represented as a sequence of M n tokens: c n = { w , . . . , w M n } . The top-level post is not part of the conversations. We also impose the same length restriction on the num-ber of comments preceding the removed comment, for com-parability and for computational considerations.

Generative component.

For the generative com-ponent of our model, we use a hierarchical recur-rent encoder-decoder (HRED) architecture (Sor-doni et al., 2015a), a modiﬁed version of the pop-ular sequence-to-sequence (seq2seq) architecture(Sutskever et al., 2014) designed to account fordependencies between consecutive inputs. Ser-ban et al. (2016) showed that HRED can suc-cessfully model conversational context by encod-ing the temporal structure of previously seen com-ments, making it an ideal ﬁt for our use case. Here,we provide a high-level summary of the HREDarchitecture, deferring deeper technical discussionto Sordoni et al. (2015a) and Serban et al. (2016).An HRED dialog model consists of three com-ponents: an utterance encoder, a context encoder,and a decoder. The utterance encoder is respon-sible for generating semantic vector representa-tions of comments. It consists of a recurrent neu-ral network (RNN) that reads a comment token-by-token, and on each token w m updates a hiddenstate h enc based on the current token and the pre-vious hidden state: h enc m = f RNN ( h enc m − , w m ) (1)where f RNN is a nonlinear gating function (our im-plementation uses GRU (Cho et al., 2014)). Theﬁnal hidden state h enc M can be viewed as a vectorencoding of the entire comment.Running the encoder on each comment c n re-sults in a sequence of N vector encodings. A sec-ond encoder, the context encoder, is then run overthis sequence: h con n = f RNN ( h con n − , h enc M n ) (2)Each hidden state h con n can then be viewed as anencoding of the full conversational context up toand including the n -th comment. To generate a re-sponse to comment n , the context encoding h con n isused to initialize the hidden state h dec of a decoderRNN. The decoder produces a response token bytoken using the following recurrence: h dec t = f RNN ( h dec t − , w t − ) w t = f out ( h dec t ) (3)where f out is some function that outputs a proba-bility distribution over words; we implement thisusing a simple feedforward layer. In our imple-mentation, we further augment the decoder withattention (Bahdanau et al., 2014; Luong et al.,015) over context encoder states to help capturelong-term inter-comment dependencies. This gen-erative component can be pre-trained using unla-beled conversational data. Prediction component.

Given a pre-trainedHRED dialog model, we aim to extend the modelto predict from the conversational context whetherthe to-be-forecasted event will occur. Our predic-tor consists of a multilayer perceptron (MLP) with3 fully-connected layers, leaky ReLU activationsbetween layers, and sigmoid activation for output.For each comment c n , the predictor takes as inputthe context encoding h con n and forwards it throughthe MLP layers, resulting in an output score thatis interpreted as a probability p event ( c n +1 ) that theto-be-forecasted event will happen (e.g., that theconversation will derail).Training the predictive component starts by ini-tializing the weights of the encoders to the val-ues learned in pre-training. The main train-ing loop then works as follows: for each pos-itive sample—i.e., a conversation containing aninstance of the to-be-forecasted event (e.g., de-railment) at comment c e —we feed the context c , . . . , c e − through the encoder and classiﬁer,and compute cross-entropy loss between the clas-siﬁer output and expected output of 1. Simi-larly, for each negative sample—i.e., a conversa-tion where none of the comments exhibit the to-be-forecasted event and that ends with c N —we feedthe context c , . . . , c N − through the model andcompute loss against an expected output of 0.Note that the parameters of the generative com-ponent are not held ﬁxed during this process; in-stead, backpropagation is allowed to go all the waythrough the encoder layers. This process, knownas ﬁne-tuning , reshapes the representation learnedduring pre-training to be more directly useful toprediction (Howard and Ruder, 2018).We implement the model and training code us-ing PyTorch, and we are publicly releasing our im-plementation and the trained models together withthe data as part of ConvoKit. We evaluate the performance of CRAFT in thetask of forecasting conversational derailment inboth the Wikipedia and CMV scenarios. To thisend, for each of these datasets we pre-train thegenerative component on the unlabeled portion of the data and ﬁne-tune it on the labeled trainingsplit (data size detailed in Section 3).In order to evaluate our sequential systemagainst conversational-level ground truth, we needto aggregate comment level predictions. If any comment in the conversation triggers a positiveprediction—i.e., p event ( c n +1 ) is greater than athreshold learned on the development split—thenthe respective conversation is predicted to derail.If this forecast is triggered in a conversation thatactually derails, but before the derailment actuallyhappens, then the conversation is counted as a truepositive; otherwise it is a false positive. If no pos-itive predictions are triggered for a conversation,but it actually derails then it counts as a false neg-ative; if it does not derail then it is a true negative. Fixed-length window baselines.

We ﬁrst seek tocompare CRAFT to existing, ﬁxed-length windowapproaches to forecasting. To this end, we im-plement two such baselines:

Awry , which is thestate-of-the-art method proposed in Zhang et al.(2018a) based on pragmatic features in the ﬁrstcomment-reply pair, and BoW , a simple bag-of-words baseline that makes a prediction usingTF-IDF weighted bag-of-words features extractedfrom the ﬁrst comment-reply pair.

Online forecasting baselines.

Next, we con-sider simpler approaches for making forecasts asthe conversations happen (i.e., in an online fash-ion). First, we propose

Cumulative BoW , a modelthat recomputes bag-of-words features on all com-ments seen thus far every time a new comment ar-rives. While this approach does exhibit the de-sired behavior of producing updated predictionsfor each new comment, it fails to account for re-lationships between comments.This simple cumulative approach cannot bedirectly extended to models whose features arestrictly based on a ﬁxed number of comments, likeAwry. An alternative is to use a sliding window :for a feature set based on a window of W com-ments, upon each new comment we can extractfeatures from a window containing that commentand the W − comments preceding it. We applythis to the Awry method and call this model Slid-ing Awry . For both these baselines, we aggregatecomment-level predictions in the same way as inour main model.

CRAFT ablations.

Finally, we consider twomodiﬁed versions of the CRAFT model in order We use the ConvoKit implementation. apabilities Wikipedia Talk Pages Reddit CMV

Model D O L A P R FPR F1 A P R FPR F1

BoW 56.5 55.6 65.5 52.4 60.1 52.1 51.8 61.3 57.0 56.1Awry (cid:88) (cid:88) (cid:88) (cid:88) − CE (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Table 1: Comparison of the capabilities of each baseline and our CRAFT models (full and without the Context En-coder) with regards to capturing inter-comment (D)ynamics, processing conversations in an (O)nline fashion, andautomatically (L)earning feature representations, as well as their performance in terms of (A)ccuracy, (P)recision,(R)ecall, False Positive Rate (FPR), and F1 score. Awry is the model previously proposed by Zhang et al. (2018a)for this task. to evaluate the impact of two of its key compo-nents: (1) the pre-training step, and (2) its abilityto capture inter-comment dependencies through itshierarchical memory.To evaluate the impact of pre-training, we trainthe prediction component of CRAFT on only thelabeled training data, without ﬁrst pre-trainingthe encoder layers with the unlabeled data. Weﬁnd that given the relatively small size of labeleddata, this baseline fails to successfully learn, andends up performing at the level of random guess-ing. This result underscores the need for the pre-training step that can make use of unlabeled data.To evaluate the impact of the hierarchical mem-ory, we implement a simpliﬁed version of CRAFTwhere the memory size of the context encoder iszero (

CRAFT − CE ), thus effectively acting asif the pre-training component is a vanilla seq2seqmodel. In other words, this model cannot captureinter-comment dependencies, and instead at eachstep makes a prediction based only on the utter-ance encoding of the latest comment. Results.

Table 1 compares CRAFT to the base-lines on the test splits (random baseline is 50%)and illustrates several key ﬁndings. First, we ﬁndthat unsurprisingly, accounting for full conversa-tional context is indeed helpful, with even thesimple online baselines outperforming the ﬁxed-window baselines. On both datasets, CRAFT out-performs all baselines (including the other onlinemodels) in terms of accuracy and F1. Further-more, although it loses on precision (to CRAFT − CE) and recall (to Cumulative BoW) individu-ally on the Wikipedia data, CRAFT has the supe- We thus exclude this baseline from the results summary.

Figure 3: Precision-recall curves and the area undereach curve. To reduce clutter, we show only the curvesfor Wikipedia data (CMV curves are similar) and ex-clude the ﬁxed-length window baselines (which per-form worse). rior balance between the two, having both a vis-ibly higher precision-recall curve and larger areaunder the curve (AUPR) than the baselines (Fig-ure 3). This latter property is particularly usefulin a practical setting, as it allows moderators totune model performance to some desired precisionwithout having to sacriﬁce as much in the way ofrecall (or vice versa) compared to the baselinesand pre-existing solutions.

We now examine the behavior of CRAFT ingreater detail, to better understand its beneﬁts andlimitations. We speciﬁcally address the followingquestions: (1) How much early warning does thethe model provide? (2) Does the model actually igure 4: Distribution of number of comments elapsedbetween the model’s ﬁrst warning and the attack. learn an order-sensitive representation of conver-sational context? Early warning, but how early?

The recent in-terest in forecasting antisocial behavior has beendriven by a desire to provide pre-emptive, action-able warning to moderators. But does our modeltrigger early enough for any such practical goals?For each personal attack correctly forecasted byour model, we count the number of commentselapsed between the time the model is ﬁrst trig-gered and the attack. Figure 4 shows the dis-tribution of these counts: on average, the modelwarns of an attack 3 comments before it actu-ally happens (4 comments for CMV). To furtherevaluate how much time this early warning wouldgive to the moderator, we also consider the differ-ence in timestamps between the comment wherethe model ﬁrst triggers and the comment contain-ing the actual attack. Over 50% of conversationsget at least 3 hours of advance warning (2 hoursfor CMV). Moreover, 39% of conversations get atleast

12 hours of early warning before they derail.

Does order matter?

One motivation behindthe design of our model was the intuition thatcomments in a conversation are not independentevents; rather, the order in which they appear mat-ters (e.g., a blunt comment followed by a politeone feels intuitively different from a polite com-ment followed by a blunt one). By design, CRAFThas the capacity to learn an order-sensitive repre-sentation of conversational context, but how canwe know that this capacity is actually used? It isconceivable that the model is simply computingan order-insensitive “bag-of-features”. Neural net-work models are notorious for their lack of trans- We choose to focus on the Wikipedia scenario since theconversational preﬁxes are hand-veriﬁed to be civil. For com-pleteness we also report results for Reddit CMV throughout,but they should be taken with an additional grain of salt.

Shuffled

Originaltrigger ... shuffle

18% change prediction ...

Figure 5: The preﬁx-shufﬂing procedure ( t = 4 ). parency, precluding an analysis of how exactly CRAFT models conversational context. Neverthe-less, through two simple exploratory experiments,we seek to show that it does not completely ignorecomment order.The ﬁrst experiment for testing whether themodel accounts for comment order is a preﬁx-shufﬂing experiment, visualized in Figure 5. Foreach conversation that the model predicts will de-rail, let t denote the index of the triggering com-ment, i.e., the index where the model ﬁrst made aderailment forecast. We then construct synthetic conversations by taking the ﬁrst t − comments(henceforth referred to as the preﬁx ) and random-izing their order. Finally, we count how oftenthe model no longer predicts derailment at index t in the synthetic conversations. If the model wereignoring comment order, its prediction should re-main unchanged (as it remains for the Cumula-tive BoW baseline), since the actual content ofthe ﬁrst t comments has not changed (and CRAFTinference is deterministic). We instead ﬁnd thatin roughly one ﬁfth of cases (12% for CMV)the model changes its prediction on the syntheticconversations. This suggests that CRAFT learnsan order-sensitive representation of context, not amere “bag-of-features”.To more concretely quantify how much thisorder-sensitive context modeling helps with pre-diction, we can actively prevent the model fromlearning and exploiting any order-related dynam-ics. We achieve this through another type of shuf-ﬂing experiment, where we go back even furtherand shufﬂe the comment order in the conversa-tions used for pre-training, ﬁne-tuning and test-ing. This procedure preserves the model’s abil-ity to capture signals present within the individualcomments processed so far, as the utterance en-coder is unaffected, but inhibits it from capturingany meaningful order-sensitive dynamics. We ﬁndthat this hurts the model’s performance (65% ac- We restrict the experiment to cases where t ≥ , as pre-ﬁxes consisting of only one comment cannot be reordered. uracy for Wikipedia, 59.5% for CMV), loweringit to a level similar to that of the version where wecompletely disable the context encoder.Taken together, these experiments provide ev-idence that CRAFT uses its capacity to modelconversational context in an order-sensitive fash-ion, and that it makes effective use of the dynam-ics within. An important avenue for future workwould be developing more transparent models thatcan shed light on exactly what kinds of order-related features are being extracted and how theyare used in prediction. In this work, we introduced a model for fore-casting conversational events that processes com-ments as they happen and takes the full conver-sational context into account to make an updatedprediction at each step. This model ﬁlls a voidin the existing literature on conversational fore-casting, simultaneously addressing the dual chal-lenges of capturing inter-comment dynamics anddealing with an unknown horizon. We ﬁnd thatour model achieves state-of-the-art performanceon the task of forecasting derailment in two differ-ent datasets that we release publicly. We furthershow that the resulting system can provide sub-stantial prior notice of derailment, opening up thepotential for preemptive interventions by humanmoderators (Seering et al., 2017).While we have focused speciﬁcally on the taskof forecasting derailment, we view this work as astep towards a more general model for real-timeforecasting of other types of emergent propertiesof conversations. Follow-up work could adaptthe CRAFT architecture to address other forecast-ing tasks mentioned in Section 2—including thosefor which the outcome is extraneous to the con-versation. We expect different tasks to be in-formed by different types of inter-comment dy-namics, and further architecture extensions couldadd additional supervised ﬁne-tuning in order todirect it to focus on speciﬁc dynamics that mightbe relevant to the task (e.g., exchange of ideas be-tween interlocutors or stonewalling).With respect to forecasting derailment, there re-main open questions regarding what human mod-erators actually desire from an early-warning sys-tem, which would affect the design of a practi-cal system based on this work. For instance, howearly does a warning need to be in order for moder- ators to ﬁnd it useful? What is the optimal balancebetween precision, recall, and false positive rate atwhich such a system is truly improving moderatorproductivity rather than wasting their time throughfalse positives? What are the ethical implicationsof such a system? Follow-up work could run auser study of a prototype system with actual mod-erators to address these questions.A practical limitation of the current analysis isthat it relies on balanced datasets, while derail-ment is a relatively rare event for which a morerestrictive trigger threshold would be appropri-ate. While our analysis of the precision-recallcurve suggests the system is robust across multi-ple thresholds (

AU P R = 0 . ), additional workis needed to establish whether the recall tradeoffwould be acceptable in practice.Finally, one major limitation of the presentwork is that it assigns a single label to each con-versation: does it derail or not? In reality, de-railment need not spell the end of a conversa-tion; it is possible that a conversation could getback on track, suffer a repeat occurrence of anti-social behavior, or any number of other trajecto-ries. It would be exciting to consider ﬁner-grainedforecasting of conversational trajectories, account-ing for the natural—and sometimes chaotic—ebb-and-ﬂow of human interactions. Acknowledgements.

We thank Caleb Chiam,Liye Fu, Lillian Lee, Alexandru Niculescu-Mizil,Andrew Wang and Justine Zhang for insightfulconversations (with unknown horizon), Aditya Jhafor his great help with implementing and runningthe crowd-sourcing tasks, Thomas Davidson andClaire Liang for exploratory data annotation, aswell as the anonymous reviewers for their help-ful comments. This work is supported in part bythe NSF CAREER award IIS-1750615 and by theNSF Grant SES-1741441.

References

Rob Abbott, Marilyn Walker, Pranav Anand, Jean E.Fox Tree, Robeson Bowmani, and Joseph King.2011. How Can You Say Such Things?!?: Rec-ognizing Disagreement in Informal Political Argu-ment. In

Proceedings of the Workshop on Lan-guages in Social Media .Yavuz Akbulut, Yusuf Levent Sahin, and BahadirEristi. 2010. Cyberbullying Victimization amongTurkish Online Social Utility Members.

Educa-tional Technology & Society , 13(4).elsey Allen, Giuseppe Carenini, and Raymond T. Ng.2014. Detecting Disagreement in Conversations us-ing Pseudo-Monologic Rhetorical Structure. In

Pro-ceedings of EMNLP .Tim Althoff, Kevin Clark, and Jure Leskovec. 2016.Large-scale Analysis of Counseling Conversations:An Application of Natural Language Processing toMental Health.

Transactions of the Association forComputational Linguistics , 4.Ofer Arazy, Lisa Yeo, and Oded Nov. 2013. Stay on theWikipedia Task: When Task-related DisagreementsSlip Into Personal and Procedural Conﬂicts.

J. Am.Soc. Inf. Sci. Technol. , 64(8).Lars Backstrom, Jon Kleinberg, Lillian Lee, and Cris-tian Danescu-Niculescu-Mizil. 2013. Characteriz-ing and Curating Conversation Threads: Expansion,Focus, Volume, Re-entry. In

Proceedings of WSDM .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural Machine Translation by JointlyLearning to Align and Translate. In

Proceedings ofICLR .Anais Cadilhac, Nicholas Asher, Farah Benamara, andAlex Lascarides. 2013. Grounding Strategic Con-versation: Using Negotiation Dialogues to PredictTrades in a Win-Lose Game. In

Proceedings ofEMNLP .Justin Cheng, Michael Bernstein, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2017. AnyoneCan Become a Troll: Causes of Trolling Behavior inOnline Discussions. In

Proceedings of CSCW .Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2014.Learning Phrase Representations using RNN En-coder–Decoder for Statistical Machine Translation.In

Proceedings of EMNLP .Benjamin Collier and Julia Bear. 2012. Conﬂict, Crit-icism, or Conﬁdence: An Empirical Examination ofthe Gender Gap in Wikipedia Contributions. In

Pro-ceedings of CSCW .Jared R. Curhan and Alex Pentland. 2007. Thin Slicesof Negotiation: Predicting Outcomes From Conver-sational Dynamics Within the First 5 Minutes.

Jour-nal of Applied Psychology , 92.Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised Sequence Learning. In

Proceedings ofNeurIPS .Thomas Davidson, Dana Warmsley, Michael Macy,and Ingmar Weber. 2017. Automated Hate SpeechDetection and the Problem of Offensive Language.In

Proceedings of ICWSM .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In

Proceedings of NAACL

Proceedings of KDD .Michel Galley, Kathleen McKeown, Julia Hirschberg,and Elizabeth Shriberg. 2004. Identifying Agree-ment and Disagreement in Conversational Speech:Use of Bayesian Networks to Model Pragmatic De-pendencies. In

Proceedings of ACL .Jianfeng Gao, Michel Galley, and Lihong Li. 2018.Neural Approaches to Conversational AI. In

Pro-ceedings of SIGIR .Kiran Garimella, Gianmarco De Francisci Morales,Aristides Gionis, and Michael Mathioudakis. 2017.Quantifying Controversy in Social Media.

ACMTransactions on Social Computing , 1(1).Codruta Girlea, Roxana Girju, and Eyal Amir. 2016.Psycholinguistic Features for Deceptive Role Detec-tion in Werewolf. In

Proceedings of NAACL .Jack Hessel and Lillian Lee. 2019. Something’sBrewing! Early Prediction of Controversy-causingPosts from Discussion Features. In

Proceedings ofNAACL .Jeremy Howard and Sebastian Ruder. 2018. UniversalLanguage Model Fine-tuning for Text Classiﬁcation.In

Proceedings of ACL .Yiqing Hua, Cristian Danescu-Niculescu-Mizil, DarioTaraborelli, Nithum Thain, Jeffery Sorensen, andLucas Dixon. 2018. WikiConv: A Corpus of theComplete Conversational History of a Large On-line Collaborative Community. In

Proceedings ofEMNLP .Yohan Jo, Shivani Poddar, Byungsoo Jeon, QinlanShen, Carolyn P. Ros´e, and Graham Neubig. 2018.Attentive Interaction Model: Modeling Changes inView in Argumentation. In

Proceedings of NAACL .David Jurgens, Libby Hemphill, and Eshwar Chan-drasekharan. 2019. A Just and ComprehensiveStrategy for Using NLP to Address Online Abuse.In

Proceedings of ACL .Joseph M. Kayany. 1998. Contexts of uninhibitedonline behavior: Flaming in social newsgroups onusenet.

Journal of the American Society for Infor-mation Science , 49(12).Srijan Kumar, William L. Hamilton, Jure Leskovec,and Dan Jurafsky. 2018. Community Interaction andConﬂict on the Web. In

Proceedings of WWW .Sarah Ita Levitan, Angel Maredia, and JuliaHirschberg. 2018. Linguistic Cues to Deceptionand Perceived Deception in Interview Dialogues. In

Proceedings of NAACL .ing Liu, Joshua Guberman, Libby Hemphill, andAron Culotta. 2018. Forecasting the Presence andIntensity of Hostility on Instagram Using Linguisticand Social Features. In

Proceedings of ICWSM .Minh-Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In

Proceedingsof EMNLP .Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in Translation:Contextualized Word Vectors. In

Proceedings ofNeurIPS .Vlad Niculae and Cristian Danescu-Niculescu-Mizil.2016. Conversational Markers of Constructive Dis-cussions. In

Proceedings of NAACL .Vlad Niculae, Srijan Kumar, Jordan Boyd-Graber, andCristian Danescu-Niculescu-Mizil. 2015. LinguisticHarbingers of Betrayal: A Case Study on an OnlineStrategy Game. In

Proceedings of ACL .Alexandra Olteanu, Carlos Castillo, Jeremy Boy, andKush Varshney. 2018. The Effect of Extremist Vio-lence on Hateful Speech Online. In

Proceedings ofICWSM .Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-ducing Gender Bias in Abusive Language Detection.In

Proceedings of EMNLP .John Pavlopoulos, Prodromos Malakasiotis, and IonAndroutsopoulos. 2017. Deeper Attention to Abu-sive User Content Moderation. In

Proceedings ofEMNLP .Ver´onica P´erez-Rosas, Mohamed Abouelenien, RadaMihalcea, Yao Xiao, C. J. Linton, and Mihai Burzo.2016. Verbal and Nonverbal Clues for Real-life De-ception Detection. In

Proceedings of EMNLP .Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In

Proceedings of NAACL .Peter Potash and Anna Rumshisky. 2017. Towards De-bate Automation: A Recurrent Model for PredictingDebate Winners. In

Proceedings of EMNLP .Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving Language Under-standing by Generative Pre-training. Technical re-port, OpenAI.Juliana Raskauskas and Ann D. Stoltz. 2007. Involve-ment in Traditional and Electronic Bullying AmongAdolescents.

Developmental Psychology , 43(3).Manoel Horta Ribeiro, Pedro H. Calais, Yuri A. San-tos, Virg´ılio A. F. Almeida, and Wagner Meira Jr.2018. Characterizing and Detecting Hateful Userson Twitter. In

Proceedings of ICWSM . Sara Rosenthal and Kathleen McKeown. 2015. ICouldn’t Agree More: The Role of ConversationalStructure in Agreement and Disagreement Detectionin Online Discussions. In

Proceedings of SIGDIAL .Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,and Noah A. Smith. 2019. The Risk of Racial Biasin Hate Speech Detection. In

Proceedings of ACL .Joseph Seering, Robert Kraut, and Laura Dabbish.2017. Shaping Pro and Anti-Social Behavior onTwitch Through Moderation and Example-Setting.In

Proceedings of CSCW .Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2016. Build-ing End-To-End Dialogue Systems Using Genera-tive Hierarchical Neural Network Models. In

Pro-ceedings of AAAI .Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville, andYoshua Bengio. 2017. A Hierarchical Latent Vari-able Encoder-Decoder Model for Generating Dia-logues. In

Proceedings of AAAI .Vivek K. Singh, Marie L. Radford, Qianjia Huang, andSusan Furrer. 2017. ”They basically like destroyedthe school one day”: On Newer App Features andCyberbullying in Schools. In

Proceedings of CSCW .Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi,Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015a. A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Sug-gestion. In

Proceedings of CIKM .Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015b.A Neural Network Approach to Context-SensitiveGeneration of Conversational Responses. In

Pro-ceedings of NAACL .Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to Sequence Learning with Neural Net-works. In

Proceedings of NeurIPS .Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu, and Lillian Lee. 2016. Winning Argu-ments: Interaction Dynamics and Persuasion Strate-gies in Good-faith Online Discussions. In

Proceed-ings of WWW .Jessica Vitak, Kalyani Chadha, Linda Steiner, andZahra Ashktorab. 2017. Identifying Women’s Ex-periences With and Strategies for Mitigating Nega-tive Effects of Online Harassment. In

Proceedingsof CSCW .Svitlana Volkova and Eric Bell. 2017. Identifying Ef-fective Signals to Predict Deleted and SuspendedAccounts on Twitter across Languages. In

Proceed-ings of ICWSM .enning Wachsmuth, Shahbaz Syed, and Benno Stein.2018. Retrieval of the Best Counterargument with-out Prior Topic Knowledge. In

Proceedings of ACL .Lu Wang, Nick Beauchamp, Sarah Shugars, andKechen Qin. 2017. Winning on the Merits: TheJoint Effects of Content and Style on Debate Out-comes.

Transactions of the Association for Compu-tational Linguistics , 5.Lu Wang and Claire Cardie. 2014. A Piece of MyMind: A Sentiment Analysis Approach for OnlineDispute Detection. In

Proceedings of ACL .William Warner and Julia Hirschberg. 2012. DetectingHate Speech on the World Wide Web. In

Proceed-ings of the Second Workshop on Language in SocialMedia .Michael Wiegand, Josef Ruppenhofer, and ThomasKleinbauer. 2019. Detection of Abusive Language:The Problem of Biased Datasets. In

Proceedings ofNAACL .Ellery Wulczyn, Nithum Thain, and Lucas Dixon.2017. Ex Machina: Personal Attacks Seen at Scale.In

Proceedings of WWW .Diyi Yang, Jiaao Chen, Zichao Yang, Dan Jurafsky,and Eduard Hovy. 2019. Let’s Make Your RequestMore Persuasive: Modeling Persuasive Strategiesvia Semi-Supervised Neural Nets on CrowdfundingPlatforms. In

Proceedings of NAACL .Dawei Yin, Zhenzhen Xue, and Liangjie Hong. 2009.Detection of Harassment on Web 2.0. In

Proceed-ings of CAW2.0 .Justine Zhang, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil, Lucas Dixon, Nithum Thain,Yiqing Hua, and Dario Taraborelli. 2018a. Conver-sations Gone Awry: Detecting Early Signs of Con-versational Failure. In

Proceedings of ACL .Justine Zhang, Cristian Danescu-Niculescu-Mizil,Christina Sauper, and Sean J. Taylor. 2018b. Char-acterizing Online Public Discussions Through Pat-terns of Participant Interactions. In

Proceedings ofCSCW .Justine Zhang, Ravi Kumar, Sujith Ravi, and Cris-tian Danescu-Niculescu-Mizil. 2016. Conversa-tional Flow in Oxford-style Debates. In