[PDF] Conversations Gone Alright: Quantifying and Predicting Prosocial Outcomes in Online Conversations

Abstract

Online conversations can go in many directions: some turn out poorly due to antisocial behavior, while others turn out positively to the benefit of all. Research on improving online spaces has focused primarily on detecting and reducing antisocial behavior. Yet we know little about positive outcomes in online conversations and how to increase them-is a prosocial outcome simply the lack of antisocial behavior or something more? Here, we examine how conversational features lead to prosocial outcomes within online discussions. We introduce a series of new theory-inspired metrics to define prosocial outcomes such as mentoring and esteem enhancement. Using a corpus of 26M Reddit conversations, we show that these outcomes can be forecasted from the initial comment of an online conversation, with the best model providing a relative 24% improvement over human forecasting performance at ranking conversations for predicted outcome. Our results indicate that platforms can use these early cues in their algorithmic ranking of early conversations to prioritize better outcomes.

Full PDF

CConversations Gone Alright: Quantifying and PredictingProsocial Outcomes in Online Conversations

Jiajun Bao ∗ [email protected] Mellon University,Language Technologies Institute Junjie Wu ∗ [email protected] Hong Kong University of Scienceand Technology Yiming Zhang ∗ [email protected] of Michigan Eshwar Chandrasekharan [email protected] of Illinois,Urbana-Champaign

David Jurgens [email protected] of Michigan, School ofInformation

ABSTRACT

Online conversations can go in many directions: some turn outpoorly due to antisocial behavior, while others turn out positivelyto the benefit of all. Research on improving online spaces has fo-cused primarily on detecting and reducing antisocial behavior. Yetwe know little about positive outcomes in online conversationsand how to increase them—is a prosocial outcome simply the lackof antisocial behavior or something more? Here, we examine howconversational features lead to prosocial outcomes within onlinediscussions. We introduce a series of new theory-inspired metricsto define prosocial outcomes such as mentoring and esteem en-hancement. Using a corpus of 26M Reddit conversations, we showthat these outcomes can be forecasted from the initial comment ofan online conversation, with the best model providing a relative24% improvement over human forecasting performance at rankingconversations for predicted outcome. Our results indicate that plat-forms can use these early cues in their algorithmic ranking of earlyconversations to prioritize better outcomes.

CCS CONCEPTS • Human-centered computing ; •

Social and professional top-ics ; •

Applied computing → Sociology ; KEYWORDS prosocial behavior, antisocial behavior, social media, behavioralforecasting

ACM Reference Format:

Jiajun Bao, Junjie Wu, Yiming Zhang, Eshwar Chandrasekharan, and DavidJurgens. 2021. Conversations Gone Alright: Quantifying and PredictingProsocial Outcomes in Online Conversations. In

Proceedings of the WebConference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.

ACM,New York, NY, USA, 12 pages. https://doi.org/10.1145/3442381.3450122 ∗ Authors contributed equally to this research.This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3450122

Post : “Studied like crazy all last week and finally took a Techniciancourse and tested on Saturday. Boom, I’m legal! Thanks for all ofthe support I’ve got on /r/amateurradio!!" ⌞ Reply 1 : That test is a killer! ⌞ Reply 2 : Great job! Hope to hear you on the air some day

Figure 1: Which reply is likely to lead to a positive, produc-tive conversation? Here, we introduce new metrics for mea-suring the prosocial qualities of social media discussionsand develop new models to predict which of these replieswill lead to a conversation with higher prosocial behavior.

Interacting with others online has become a common facet of dailylife. Yet, these interactions often can turn out poorly, in part dueto toxic behavior on the part of others [28, 48]. Given the signifi-cant impact of experiencing these negative activities on well-being[34, 66, 67], substantial research effort has been put into detectingsuch toxic behavior and facilitating platform tools to remove it.However, despite sophisticated techniques for measuring antiso-cial behavior, the key metrics for prosocial behavior are relativelyunknown, to the point that major platforms such as Twitter andInstagram have both called for researchers to develop such metrics[30, 74]. Here, we operationalize theories from social psychologyto quantify and measure prosocial behavior in social media, show-ing a rich diversity in the types of behaviors, and then show thatthese positive outcomes can be forecasted from the early stages ofa conversation.The impact of online discussions on daily life and mental healthhas prompted multiple studies on online conversational dynamics.A significant effort has focused on detecting antisocial behaviorssuch as hate speech [79] , trolling [21] , or bullying [52], in attemptsto mitigate their effect by offering moderators tools to find andremove them. Further, recent work has shown that the initial startof a conversation can forecast antisocial outcomes from early lin-guistic and behavioral information [19, 39, 52, 85]. However, onlya handful of studies have examined prosocial behaviors, such asmaking constructive comments [45, 46, 59] or offering supportivemessages [61, 77, 78]. Our work brings together these lines of re-search through a systematic examination of prosocial behaviorsand building models to forecast these conversational outcomes. a r X i v : . [ c s . C Y ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Bao,* Wu,* Zhang,* Chandrasekharan, and Jurgens

We introduce eight types of prosocial metrics and develop newmethods to forecast the prosocial trajectory of a conversation fromits early interactions. Focusing this task on one of prediction isstrongly motivated by the implications for real-world impact. On-line platforms regularly engage in content reranking where com-ments and threads are reordered according to internal objectives[14, 51]. Given the ability to predict the prosocial trajectory of aconversation, platforms can potentially rerank the initial commentsto a post (or other comments) to emphasize those that will lead tobetter community experiences.This paper offers the following three contributions. Using the-oretical insights from prior literature on prosocial and antisocialbehavior in online and offline contexts (§2), we introduce a panel ofprosocial metrics and construct a large-scale corpus of social mediaconversations labeled by these outcomes (§3). Using this corpus,we demonstrate that these metrics are significantly associated withhuman judgments of prosociality and show that prosociality is notjust the absence of antisocial behavior. Second, we introduce fournew models for forecasting the prosocial quality of a conversation(§5), showing that such outcomes can be accurately forecasted fromcues early in the conversation. Third, given the first comment oftwo conversations, we demonstrate that both our models and peo-ple can forecast which conversation is likely to turn out better (§6),with our model offering a 24% improvement on human accuracywith respect to chance. Our work has implications for platforms’abilities to surface online interactions in order to create positiveoutcomes for individuals participating in them.

Prosocial behavior began as an antonym used by social scientists fordescribing the opposite of antisocial behavior [6, 44], with Mussenand Eisenberg-Berg defining the behavior as “voluntary actionsthat are intended to help or benefit another individual or group ofindividuals.” Since this time, prosocial behavior has broadened toinclude a range of activities: helping, sharing, comforting, rescuing,and cooperating [9]. Our work examines prosocial behavior inonline discussions by deriving a large cohort of candidate metricsfor measuring conversations from theory and then testing whichare associated with judgments of prosocial behavior online.The concept of prosociality is complex and the nuances of whichaspects of behavior contribute to its perception online are not yetwell understood. A few recent approaches examining specific fac-tors related to positive conversational outcomes like constructivecomments [59, 60], politeness [24], supportiveness [78], or empathy[15, 70, 86]; or, showing that, in general, online prosocial behaviorsmirror offline trends [82]. In the majority of cases, only individualdimensions have been analyzed; however, we note that recent workhas proposed studying these dimensions jointly in relationshipsand social interactions [22] using the ten social dimensions out-lined in Deri et al. [25]. Similar to the present work, Choi et al.[22] examines general factors from sociological and psychologicalliterature for relationships to study interactions; however, the fac-tors used here are specifically grounded in prosocial literature andinclude behavioral factors in addition to linguistic factors. A fewstudies have tried to measure prosocial behavior as a single variable[32, 33]; however, these approaches in practice have used lexicons that recognize only a subset of the possible prosocial behaviorsfocused on collective interest and interpersonal harmony.Prosocial behaviors can take many forms depending on the par-ties involved and their needs. Here, we identify eight broad cate-gories of behavior ground in prior work from Social Psychologythat can be easily operationalized using NLP techniques and aremarkers of direct prosocial behavior or are behaviors that serveas a precursor to prosocial behavior. As many of these behaviorshave been identified and studied in offline settings, our aim is tostudy how these behaviors are interpreted in online settings inorder to curate a set of prosocial metrics that match peoples’ read-ings of online interactions. Following, we outline each category, itsconnection to prior theory, and summarize how its behavior is rec-ognized. Additional details for metrics and classifiers are providedin Appendices A.1–A.8.

Information Sharing.

Individuals seek out information onlinewhere others may provide suggestions. In some settings these ef-forts are codified around collaboratively creating information goodslike Wikipedia or open source projects [71]. In social media like Red-dit, responses to questions create persistent knowledge that can belearned from by others. This knowledge transfer may take the formof explicit information or references to websites such as Wikipediaor StackOverflow. Here, we operationalize these information shar-ing behaviors in two ways. First, using information-providing basedsubreddits (e.g., r/AskScience ), we train a classifier to recognizeinformative replies to questions; and then as a prosocial metric, usethe classifier to identify and count these replies in a conversation—i.e., does a discussion lead to information sharing? Second, recog-nizing that URLs often serve as important sources of third-partyinformation, we include counts for (i) information-based domains,e.g., wikipedia.org, and (ii) for all other websites, recognizing thatmany domains may serve informative purposes (e.g., linking to aproduct review). More details about the training of the classifiercan be found in Appendix A.1.

Gratitude.

Gratitude serves an important function in fosteringsocial relationships and promoting future reciprocity [8, 29]. Grati-tude not only reinforces existing prosocial behavior, but also mo-tivates more prosocial behavior and itself serves as an indicatorthat prosocial behavior has occurred [2, 55, 56]. Here, we identifygratitude through a fixed lexicon of phrases that signal gratitude(Appendix A.4), e.g., “thank you,” and count how many times suchphrases are used in a discussion.

Esteem Enhancement.

Prosocial behavior is known to be mo-tivated by a person’s self-esteem [9]. Individuals may seek outopportunities to behave prosocially as a way to repair or improvetheir self-esteem [12]; improved esteem can increase the perceptionof reciprocity for help, further motivating prosocial actions. Thus,esteem-enhancing actions can serve as a useful behavior to monitoras precursors for prosocial behavior. Here, we measure esteem en-hancement using three metrics. First, recognizing that politeness isoften used to signal social status [13], we use a politeness regressorbased in part on Danescu-Niculescu-Mizil et al. [24] to measurethe average politeness of comment interactions (Appendix A.7);the underlying hypothesis is that more polite messages increase onversations Gone Alright: Quantifying and Predicting Prosocial Outcomes in Online Conversations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia the status (esteem) perception of the recipient. Second, we iden-tify all statements with second-person pronominal references (e.g.,you) and count how many have strongly positive sentiment, whichapproximates identifying compliments (Appendix A.5). Third, wemeasure the total score given to responses in a discussion as ameasure of esteem given by the community to the conversation.The score of a comment is closely correlated with the number ofupvotes it receives (as derived through a proprietary measure) andreceiving upvotes is known to be an esteem enhancing action [16].

Social Support.

In times of distress, individuals turn to their socialnetwork for support [9, 80]. Online communities and platformshave provided a parallel support mechanism around many types ofneeds, such as physical and mental health [11, 57, 77] and weight-loss. Moreover, beyond specific needs, individuals offer supportivemessages in general on these platforms, e.g., encouragement [78].Due to the diversity of support expected on Reddit, we developa computational model to recognize supportive messages usingthe data of Wang and Jurgens (Appendix A.8) and use the averagesupportiveness of comments in a conversation.

Social Cohesion.

Social ties create a sense of community, whichcarries with it the benefits of group membership and altruisticbehavior between members [1, 35]. Conversely, exclusions froma group or a weakening of ties decrease prosocial behavior [73].Indeed, Prinstein and Cillessen note that helping someone join aconversation is a core prosocial behavior and studies have shownthat increased linguistic accommodation [62, 72] is associated withincreased prosocial behavior [47] and trust [69].To measure the formation of social bonds, we use four categoriesof metrics. First, building on the insight of Kulesza et al. [47], wemeasure linguistic accommodation between commenters using themethods of Danescu-Niculescu-Mizil et al.. Second, recognizingthat increased conversation gives rise to social ties, we includemetrics for (i) the total number of participants in a conversation,(ii) the longest number of sustained turns between two people, and(iii) the depth of the conversation’s comment tree. Third, laughter,as a function of humor, is known to create positive affect betweenpeers and increase cohesion among group members [4, 36, 64].Therefore, we count the number of laughter events in a conversation(see Appendix A.2 for details). Fourth, self disclosure is known tostrengthen social ties [27, 41, 42]; to measure disclosures, we followprior work [5, 7, 40] and count the number of comments includinga first person pronoun.

Fundraising and Donating.

An offline prosocial behavior thatreadily transfers to online behavior is fundraising for charitableactivities [71, 83]. Many online charities have websites set up toreceive donations and sites such as gofundme for more individu-alized fundraising activities. Here, we count the number of URLsto these types of sites using a list of popular charities, detailed inAppendix A.6.

Mentoring.

Individuals can give expertise through mentoringor advice when others are having a problem, which is a knownprosocial behavior [65, 83]. To recognize advice giving and men-toring, we train a classifier to distinguish the language of advicein the responses of advice-based subreddits (e.g., /r/FashionAdvice , /r/RelationshipAdvice , /r/LegalAdvice ) from responses in other subreddits. Then, we use this classifier to recognize and count thenumber of advice-based responses in a conversation thread. Furtherdetails on the classifier can be found at A.3. Absence of Antisocial Behavior.

Is prosocial behavior implied bythe absence of antisocial behavior? If true, this question suggeststhat platforms’ efforts to find and remove antisocial behavior aresufficient for retaining and promoting prosocial behavior. To answerthis question, we label all replies using the Perspective API [84],which assigns a score for toxicity. Highly toxic content like explicithate speech is assigned scores closer to 1, while more casuallyoffensive language is typically scored closer to the positive decisionboundary of 0.5. Here, we consider two definitions for antisocialcomments: if the comment’s score is above the standard decisionboundary (0.5) or a higher boundary (0.8) for highly toxic content.As metrics, we include both (i) the count of comments exceedingboth threshold as well as (ii) the percentage of non-toxic comments;the latter percentage allows us to model large discussions wheresome sub-discussions turn antisocial, but the majority of content isnot antisocial.We note that toxicity itself can be challenging to measure [75].Multiple models have been proposed for handling different aspectsof antisocial behavior [31] and that these models frequently havegaps in what they recognize [3, 43], can encode biases with re-spect to social groups [68], and be susceptible to adversarial attacks[37]. Nevertheless, as a widely-used measure, Perspective API pro-vides a replicable—though imperfect—tool specifically designed forrecognizing multiple forms of toxicity in online conversations incomments, which is the medium studied here.

This study’s goal is to forecast the future behavior of a conversationfrom early signals. Specifically, we aim to predict whether theinitial comment in a conversation will signal eventual prosocialbehavior. This forecasting goal mirrors analogous work in antisocialbehavior on conversational trajectories [85], as a first step towardsunderstanding prosocial evolution in conversations. We pursuethis goal using data from Reddit, a large social media platform whereusers create posts and participate in threaded conversations relatingto that post. Critically, Reddit provides millions of conversationsacross different communities, known as subreddits, which span avariety of topics and interaction styles.To analyze conversations, we extract all conversational threadsunder a Top-level Comment (Tlc) to a post; such comments are typ-ically made in response to the post itself and serve as conversationstarters for the rest of the community. We filter these conversationsby removing those where the Tlc (i) has more than 3500 words,as manual inspection showed these were frequently spam posts,or (ii) has been deleted by the user or removed by a moderator.Additionally, Reddit includes a small number of bot accounts thatinteract in these conversations (e.g., replying with the number of“ooofs” a user has made); to avoid any confounding effect of thesebots, we remove all threads containing a comment by a bot account, We also acknowledge that other setups have used more conversation as context forforecasting, e.g., Chang and Danescu-Niculescu-Mizil [19] and Liu et al. [52]. Whilethese too are valid setups, we focus on the initial comment intentionally to see whetheremerging prosocial conversations can be quickly identified and prioritized.

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Bao,* Wu,* Zhang,* Chandrasekharan, and Jurgens

Prosocial Metric Description MCC Percentage

Information sharing number of replies classified as informative . *** 0.1218Link replies number of links (urls) in all replies . *** 0.1095Educational Link replies number of educational links (urls) in all replies . *** 0.0070Gratitude number of gratitude in all replies . *** 0.1096Politeness average politeness value of all replies . . † . *** 1.0000Supportiveness average supportiveness value of all replies . . *** 1.0000Direct replies number of replies responding directly to the Tlc . *** 1.0000Conversation depth length of the longest replies’ thread . *** 1.0000Sustained conversation partners number of distinct user pairs appear in all replies . *** 0.4351Sustained conversations number of turns in the longest two-person conversation . *** 0.4351Compliments number of compliments . *** 0.0045Laughter number of expressions of laughter . *** 0.0716Personal disclosure number of statements an authors makes about themself . *** 0.3302Donations number of links to charities and donation sites . . *** 0.0950 % of non-toxicity (untuned) percentage of non toxic replies in all replies . *** 0.1392 % of non-toxicity (tuned) percentage of non extremely toxic replies in all replies . − . *** 0.1392Tuned toxic language number of extremely toxic replies − . Table 1: Theory-based metrics used to measure prosocial outcomes in conversations (colored by category) and their correlationwith human judgments of prosociality, using Matthew’s Correlation Coefficient (MCC; see §4 for details) and their rates ofoccurrence in conversations. Throughout the paper, we use *** to denote p < < < † Acommodation varied in significance across all types, with 10 having a significant MCC (shown herewith the mean). Train Dev Test

Tlc 4,290,361 10,844,404 10,844,281Subreddits 11,992 53,675 53,650

Table 2: Dataset sizes; note that training data has been down-sampled for computational tractability. drawn from a known list of bots and a manually-curated list basedon inspection of high-frequency-posting accounts.The final dataset was constructed from a single year of Redditactivity using the ∼ What types of conversations do people find more prosocial? Themultifaceted nature of prosociality makes numerically rating con-versations challenging. Therefore, we answer this question by fram-ing the rating as a paired choice task: given two conversations,select which conversation contains more prosocial behavior.This binary rating setup also allows us to directly evaluatewhether each proposed prosocial metric aligns with human judg-ments. For each metric (§2), we measure the strength of associationwith human judgments by computing Matthews Correlation Coef-ficient (MCC) from a 2x2 contingency table of which conversationin the pair had a higher metric score versus which was selected byannotators as being more prosocial. Annotated data was collected in two phases. An initial 2000conversation pairs were collected by sampling two Tlc made tothe same post using two strategies: (1) 1000 pairs of conversationsmade any time after the post was authored and (2) 1000 pairs madewithin 5 minutes of their post. After this initial selection, an ad-ditional 388 pairs of conversations were included to ensure thateach metric occurred in at least 100 pairs, with the exception of theDonation metric, which only occurred in 78 pairs. Two annotators While related to the 𝜒 measure, MCC differs in that it measure the strength of asso-ciation, rather than just whether the difference in the ratings is statistically significant. onversations Gone Alright: Quantifying and Predicting Prosocial Outcomes in Online Conversations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia participated in three rounds of training and then divided up theannotations. Annotators attained a Krippendorff’s 𝛼 of 0.78 on 300mutually-labeled pairs, indicating high agreement. Results.

Table 1 summarizes all of the metrics and their corre-lations, revealing that most of the metrics predicted from theoryare significantly associated with human judgments of prosocial-ity. Four trends merit noting. First, metrics for the breadth anddepth of conversations were most correlated with prosocial judg-ments, with the strongest association for sustained conversationbetween people; these behavior promote social cohesion and, giventhe discussion-focused nature of Reddit, are easily measured inconversation. Second, the second-most associated category wasfor information-providing behaviors. While less common in Redditconversations as a whole, these actions help other users meet theirinformation needs. Third, surprisingly, prior metrics suggested forprosocial behaviors, politeness and supportiveness, were positivebut not significantly associated. We view this negative result as re-quiring further investigation to confirm, as more precise models formeasuring these behaviors may provide more insight. Fourth, themetric around toxicity showed that, indeed, the absence of toxicitywas only moderately correlated with human judgments of prosocialconversations, while other types of behavior were more associated.Further, the presence of extremely toxic language, though rare, wasnot correlated, indicating that a broader picture of toxic behavior—not just extreme events—is necessary for prosocial judgment. Thisresult also indicates that for platforms to measure their health, newmetrics like those proposed above are needed and not simply mea-sure the absence of antisocial behavior. However, as our adoptedmeasure of toxicity is a coarse-grain estimate, a potential avenuefor future work is to examine whether the absence of specific formsof antisocial behavior or toxicity might individually be associatedwith the perception of prosocial behavior.

Synthesizing Prosociality.

Prosocial behaviors often share a com-mon motivation as individuals seek to engage constructively withone another. Thus, with their conceptual and thematic similarities,a single conversation may contain several of these prosocial behav-iors. Given their shared motivation, we ask whether the prosocialmetrics could be summarized with a single proxy metric? To testthis, we adopt the approach of [76] for synthesizing a single metricfrom related prosocial behaviors around respect and compute thePrincipal Component Analysis (PCA) over all the metric scores forconversations in the training data. PCA decomposes these in a setof underlying latent behaviors, capturing the inherent correlationsbetween metrics.As shown in Figure 2, the first PCA component explains 57.4%of the variance in the prosocial metrics’ values and positively loadson all prosocial metrics (Appendix Figure 7), suggesting the compo-nent effectively captures shared behavior. In contrast, the secondprincipal component captures roughly 10% of the variance, withits loading not reflective of a coherent set of prosocial behaviors(Appendix Figure 8). Thus, while a simplification of the inherentcomplexity of online behavior, the first principal component offersa compelling single value to act as a proxy in comparing the varietyof behaviors seen in conversations. In this sense, the componentacts analogously to other high-level estimates of behavior such as

Percent of Variance Explained (%)

PC0PC1PC2PC3PC4 P r i n c i p l e C o m p o n e n t s Percent of Variance Explained (%)

Figure 2: The percentage of variance explained by the firstfive principal components across the values of prosocialmetrics for conversations shows that the first PCA compo-nent explains 57.4% of the variance in the prosocial metrics’values, indicating many prosocial metrics are highly corre-lated. The loadings for the first and second components areshown in Appendices Figures 7 and 8. the toxicity score measured by Perspective API [84] that provides asingle value for downstream applications.To further test and validate the use of the first principal compo-nent as a proxy in our experiments, we calculate its MCC score withhuman judgments of prosociality, as was done for the individualmetrics (Table 1). The resulting MCC of 0.63 is higher than theMCC for any single metric and indicates the component’s valueis strongly reflective of human judgments of prosociality. Thus,given the single component that explains a substantial portion ofthe variance and its close correlation with human judgments, weview this first component as an effective single proxy for evaluatingconversations. However, we recognize that the metrics studied here,while diverse, do not capture all of the prosociality, nor does thefirst component capture all of the variance, so that while stronglycorrelated, this first component is only an initial step at estimatingprosociality.

Given that the proposed metrics reflect human judgments of proso-cial behavior, we introduce computational models to forecast thedegree of prosociality a conversation will ultimately from its initialdiscussion. Our ultimate motivation is to train a forecasting modelon conversational outcomes using massive amounts of data labeledwith computational-estimated prosociality and then, in §6, fine-tunethis model to rank conversations based on human judgments.

The first principal component strongly correlates with human judg-ments (§4) and therefore we treat the first component’s value asa single numeric estimate of the prosociality of a conversationfollowing a Tlc. We refer to this value as the Tlc’s prosocial trajec-tory . While using a single value to capture prosociality across allcomments under a Tlc likely simplifies some nuances of differentbehaviors, the single value nonetheless provides a useful proxy forconversational quality akin to other antisocial metrics; further, the

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Bao,* Wu,* Zhang,* Chandrasekharan, and Jurgens

PCA analysis showed that a single component captured the major-ity of the variance, with no other component having a consistent orsubstantial loading on prosociality, suggesting that a single metric,while simplifying, could still be effective at reflecting broad trendsin conversational prosociality.Models are trained to predict the prosocial trajectory value given(i) the title and text of a post, (ii) the Tlc, and (iii) metadata forthe comment including the subreddit and time when it was posted.Models are fit and tested using the training, development, and testpartitions shown in Table 2 using MSE as the objective.

Models are trained using four categories of features. The first in-cludes features from all the prosocial metrics in §2, with the ex-ception of accommodation; these features provide some estimateof whether the conversation is beginning on a positive note. Thesecond category includes features reflecting the comment’s relation-ship to the post: i) topic distributions of the post and Tlc, ii) cosinesimilarity of the two topic distributions, and iii) Jaccard similarityof non-stop word content in post and Tlc. For topics, a 20-topicLDA model was trained for post and Tlc text each using Mallet [54].The third category includes features of the Tlc: i) number of words,ii) sentiment iii) subjectivity, iv) number of misspelled words, v)Flesch-Kincaid reading score, and vi) author gender. Finally, thefourth category reflects the circumstances in which the Tlc wasmade: i) the subreddit containing the Tlc ii) time features for theday of a month, day of a week, and hour of a day, and iii) minutesbetween the post’s creation time and Tlc’s creation time.

Baselines.

As a first baseline, we train a linear regression with L2loss over all the features in §5.2, using unigram and bigram featuresfor the Tlc text. The second baseline uses XGBoost [20], whichallows us to test for combinatorial interactions between features.XGBoost was trained with a tree-based booster that had a learningrate 𝜂 of 0.05, L2 regularization 𝜆 of 3.0, and L1 regularization 𝛼 of 1.0. The minimum loss reduction 𝛾 required to make a furtherpartition on a leaf node of the tree was set to 1. The maximumdepth of a tree was 4. The subsample ratio of the training instancesand that of columns when constructing each tree were both 0.8.We trained the model for 5000 iterations, with one parallel treeconstructed during each iteration. Our Models.

We introduce two trajectory-forecasting modelsbuilt on top of the Albert model [50], which is a refinement of theBERT pretrained language model [26]. In this model shown in Fig-ure 3, the post and Tlc are fed into separate Albert-based networksand the [CLS] tokens from each are used as representations oftheir text. This vector is concatenated with a vector containing thenon-textual features from §5.2 to represent the entire input. Theoutput layer consists of a linear layer. The subreddit is representedas an embedding; these embeddings are initialized from the 300-dimensional embeddings from Kumar et al. [49] but reduced to 16dimensions using PCA, which accounted for 72% of the variance. Atotal of 5278 of our 11,993 subreddits had predefined embeddingsfrom this process, with the remaining using random initialization.To measure the effect of pre-training, we include a version of the

TLC ALBERT

TLC

Post ALBERT

Post

Subreddit

EmbeddingMeta Linear

Prosocial

Trajectory

Figure 3: The proposed model for predicting a conversa-tion’s prosocial trajectory. C a t e g o r y Figure 4: The MSE of prosocial forecasts within different sub-reddit categories shows that our Albert model attains higherperformance in communities whose discussion relates topop-culture such as Movies, Art, and Culture. Mean scoresfor our model and XGBoost are reported in Table 5. model using only the off-the-shelf Albert parameters that are leftunchanged and a second version that is first fine-tuned on Redditpost and Tlc text (respectively) using masked language modelingand then its parameters are updated during trajectory training. Ad-ditional details on hyperparameters and training are in AppendixC.

Models were able to identify sufficient signals of the conversationaltrajectory from just the post and Tlc to forecast its eventual value,as shown in Table 3. While performance is low, high performanceis not expected in this setting as models only have access to thestart of a conversation, which can take many potential trajectories.Nevertheless, the substantial improvement of the XGBoost and ourfull model over both the mean-value and linear regression baselinesindicate that some signals can be reliably found which would aidin proactive conversation sorting. Examining relative differencesbetween the deep learning models, fine-tuning the language modelparameters was critical to performance improvement, with thebaseline Linear Regression outperforming the substantially morecomplicated model that used off-the-shelf parameters. onversations Gone Alright: Quantifying and Predicting Prosocial Outcomes in Online Conversations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Model MSE R Mean Value (baseline) 3.010 -0.003Linear Regression 2.393 0.209XGBoost 2.209 0.269Our model (frozen Albert) 2.209 0.157Our model 2.230 0.262

Table 3: Mean squared Error and R for forecasting prosocialconversational trajectory shows that models can estimatethe trajectory value from early signals. Conversations in some topics may be easier to forecast than oth-ers. To test for topical effects, we use the subreddit categorizationfrom http://redditlist.com/ and compute within-category MSE. Aclear trend emerges where both the XGBoost and our model per-formed better than average for categories related to pop-culturesuch as Movies, Art, and Culture (p < r/aww, r/funny , and r/NatureIsFuckingLit , three highly popular sub-reddits with millions of subscribers, suggesting the model performswell in lighthearted discussions. In contrast, the ten subreddits withthe highest MSE included, r/changemyview, r/PoliticalDiscussion ,and r/geopolitics , three popular subreddits that feature long, often-contentious discussions. We speculate that conversations in thosecommunities are more unpredictable due to the inherent tensionaround the topics and therefore reflect a significant challenge toforecasting models.Prosocial behavior may occur at any point in the subsequentconversation, which creates a challenge for our model that fore-casts from only the initial comment. As a follow-up analysis, wetest how our model’s error changes relative to when the prosocialbehavior occurs. We sampled 307K conversations of even length 𝑛 and measure the prosociality (via the first principal component)of the first 𝑛 comments and last 𝑛 comments, ordered temporally.We then stratify the conversations relative to whether the first halfwas more prosocial, less prosocial, or even. Figure 5 reveals thatacross conversation sizes (shown up to 20 comments), our modelconsistently has lower error for prosocial behavior that occurssoon after the initial Tlc. Across all sizes, conversations with earlyprosociality have lower error (mean MSE 5.96) than those with laterprosociality (mean MSE 7.81; p < ∼

8% of the conversations hadthe same estimated prosociality in each half, which suggests thatfuture work could identify new dimensions or refine the tools ofour existing measures to better discriminate between such cases.

Number of Comments in Conversation M S E o f P r o s o c i a li t y F o r e c a s t Prosocialityfirst-half more prosocialsame prosocialitysecond-half more prosocial

Figure 5: Error in forecasting prosociality relative to whenthe prosocial occurred within the subsequent conversation.

We have shown that the prosocial trajectory of a conversationcan be forecasted from the linguistic and social signals in its firstcomment. Here, we test whether these models can be used to rankconversations by their potential outcome. We adopt a simplifiedranking setting where a model (or human) is shown a post and twoof its Tlc and asked to select which Tlc is likely to lead to a moreprosocial conversation.

Data was drawn from the 2388 instances annotated in §4 for whichconversation was more prosocial. An additional 1000 pairs wereannotated where one of the Tlc received no replies. This data waspartitioned into 80% training, 10% development, and 10% test, strat-ifying across the three sampling strategies used to create it. Twoannotators rated pairs of Tlc to the same post according to theirjudgments of which were more likely to result in a positive, proso-cial outcome. Annotators had an Krippendorff’s 𝛼 =0.59. As the taskis inherently difficult with no objective ground truth present in theTlc, high agreement is not expected; however, the moderate agree-ment indicates that annotators were able to consistently identify acommon set of lexical features they considered predictive. Our proposed ranking model, shown in Figure 6, fuses two of thepost-and-comment forecasting models (Figure 3) with a linear layerto allow fine-tuning on held out human-judgments. To measurethe effect of pre-training, we include a model that directly uses thetrajectory estimates of the component models and selects the Tlcwith a higher estimate; a similar model is included for XGBoost. Asa baseline comparison, we include a logistic regression classifiertrained on unigram and bigrams directly on the Tlc. Finally, weinclude an oracle classifier that uses the actual trajectory value foreach Tlc and selects the higher-valued; this oracle-based classifier Note that during the previous annotation, the starts of these conversations werenever shown to annotators. While empty conversations would seem to always be less prosocial, annotatorspreferred such conversations preferred to those containing mostly toxic comments,though such cases were rare in practice.

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Bao,* Wu,* Zhang,* Chandrasekharan, and Jurgens

TLC Post Subreddit Meta Our Model Linear

Softmax

Our Model

TLC Post Subreddit Meta Figure 6: Diagram of our deep learning model for forecast-ing which conversation will be more prosocial. reflects the upper bound for performance if forecasting modelswould perfectly estimate trajectory and rank using only that value.

Models were able to surpass human performance at correctly select-ing the conversation that would result in more prosocial behavior,as shown in Table 4. Consistent with prior work on predictingantisocial behavior in early conversations, high performance isnot expected in this tightly-constrained setting [39, 52, 85], as newpeople join conversations each with their own interests and motiva-tions that affect the trajectory. However, the moderate performanceon this difficult task suggests that models can reliably pick up onprosocial signals from the very first comment in a discussion, whichis sufficient for aiding in re-ranking newly-started conversations.The pre-trained forecasting model’s accuracy is the primarydriver for performance. We can observe a soft upper bound forperformance by comparing the model with the ranking predic-tion performance of a model that perfectly predicts the trajectory,shown as the oracle trajectory prediction . If a forecasting modelwould be able to forecast the trajectory with perfect accuracy (likethis oracle), simply picking the conversation with the higher es-timated trajectory would achieve an 86.5% accuracy at selectingthe conversation that ultimately had more prosocial behavior. Al-though exact trajectory estimates are unlikely from the Tlc alone,this illustration’s result suggests that higher ranking performanceis possible in future models, such as those with more data fromincrementally forecasting as the conversation grows after its ini-tial stages; indeed, models for forecasting antisocial behavior fromlonger context suggest that such a result is possible [19].Despite having moderate agreement on which Tlc was predictedto have a more prosocial outcome, humans performed worse thanthe proposed model. Annotators and the best model had only weakagreement in their judgments ( 𝛼 =0.29) and with 69.7% of the an-notator’s correct decisions also being selected by the model. Thisresult, combined with the inter-annotator agreement, suggests thatannotators were able to pick up on a complementary set of lin-guistic signals not used by the model, which future models mightidentify to improve performance. Model Accuracy

Logistic Regression on Tlc 0.463XGBoost 0.540

Human Prediction

Oracle Trajectory Prediction

Table 4: Performances at predicting which of two conversa-tions will have a more prosocial outcome shows that our bestcomputation model outperforms human predictions.

Prosociality can take many forms and in this paper, we have devel-oped classifiers to recognize a variety of these behaviors, showingthey can be recognized and that many are correlated with eachother. However, there are multiple directions that could be takento better reflect prosociality as a whole. First, our model is agnosticto the community itself in considering what behavior should beconsidered prosocial, even though communities are known to havedifferent social norms [17, 18, 53]; for example, the jocular natureof sports and gaming communities may consider politeness outof the ordinary and not inline with their desired prosocial behav-iors. This direction is further supported by the observed variancein performance across different categories of subreddits (Figure4), which suggests that directly incorporating the norms of spe-cific communities could improve performance. Second, while ourPCA score unifies many prosocial behaviors under a single metric(much like general “toxicity” scores), a significant amount of varia-tion remains unexplained. The PCA value used in this paper offersa compelling and practical operationalization. However, furtheranalyses are needed to identify other prototypical forms of proso-ciality and their effect on conversations. Third, our forecast andranking models simplify the task to only looking at a Tlc in pre-dicting conversational trajectory. As conversations can often takeunpredictable turns, these later comments are likely to influence itsprosocial trajectory, which cannot be observed from the Tlc alone.However, given our promising results on just the Tlc, later mod-els may improve upon these results through iteratively predictingprosocial trajectory from the growing sequence of comments in aconversation, as others have done in forecasting antisocial behavior[19, 52].

Online conversations can take many trajectories, not all of thempleasant. Improving our ability to recognize and highlight nascent prosocial conversations early on in their discussion can directlyimpact the daily lives and discussions of millions by fostering amore amicable and productive discourse online. This paper hasintroduced a series of metrics for different forms of prosocial behav-ior and accompanying computational techniques for recognizingthem, showing that these behaviors are strongly correlated withhuman judgments of prosocial conversations—and that prosocialityisn’t simply explained by the absence of antisocial behavior. Intwo experiments, we introduce a series of deep learning modelsshowing that prosocial trajectories can be forecasted from just the onversations Gone Alright: Quantifying and Predicting Prosocial Outcomes in Online Conversations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia initial comment in a conversation. Then, we show that these mod-els can be adapted to predict which of two conversations is morelikely to have a prosocial outcome from these signals, providing aranking mechanism for increasing the visibility of conversationslikely to have prosocial outcomes. While forecasting from suchlittle data is difficult—but critical to ranking new conversations—our model is able to substantially improve on human performanceover chance (24%) at selecting the one with better outcome. Ourmodel provides an initial starting point for conversation rankingand we show that if the forecast was completely accurate, suchmodels would have an upper limit of 86.5%, further motivatingwork in this area. Code, data, and models are available at https://github.com/davidjurgens/prosocial-conversation-forecasting.

ACKNOWLEDGMENTS

This material is based upon work supported by the National ScienceFoundation under Grants No 1850221 and 2007251.

REFERENCES [1] George A Akerlof and Rachel E Kranton. 2000. Economics and identity.

TheQuarterly Journal of Economics

Social and Personality Psychology Compass

6, 6 (2012), 455–469.[3] Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate speech detection isnot as easy as you may think: A closer look at model validation. In

Proceedingsof SIGIR . 45–54.[4] Jo-Anne Bachorowski and Michael J Owren. 2001. Not all laughs are alike: Voicedbut not unvoiced laughter readily elicits positive affect.

Psychological Science

Proceedings of EMNLP .1986–1996.[6] Daniel Bar-Tal. 1976.

Prosocial behavior: Theory and research.

HemispherePublishing Corp.[7] Azy Barak and Orit Gluck-Ofri. 2007. Degree and reciprocity of self-disclosurein online forums.

CyberPsychology & Behavior

10, 3 (2007), 407–417.[8] Monica Y Bartlett and David DeSteno. 2006. Gratitude and prosocial behavior:Helping when it costs you.

Psychological science

17, 4 (2006), 319–325.[9] C Daniel Batson and Adam A Powell. 2003. Altruism and prosocial behavior.

Handbook of psychology (2003), 463–484.[10] Roger Berry. 2009. You could say that: the generic second-person pronoun inmodern English.

English Today

25, 3 (2009), 29–34.[11] Prakhar Biyani, Cornelia Caragea, Prasenjit Mitra, and John Yen. 2014. Identifyingemotional and informational support in online health communities. In

Proceedingsof COLING .[12] Jonathon D Brown and S Smart. 1991. The self and social conduct: Linking self-representations to prosocial behavior.

Journal of Personality and Social psychology

60, 3 (1991), 368.[13] Penelope Brown. 2015. Politeness and language. In

The International Encyclopediaof the Social and Behavioural Sciences (IESBS),(2nd ed.) . Elsevier, 326–330.[14] Taina Bucher. 2012. Want to be on the top? Algorithmic power and the threat ofinvisibility on Facebook.

New media & society

14, 7 (2012), 1164–1180.[15] Sven Buechel, Anneke Buffone, Barry Slaff, Lyle Ungar, and Joao Sedoc. 2018.Modeling empathy and distress in reaction to news stories. In

Proceedings ofIJCNLP .[16] Anthony L Burrow and Nicolette Rainone. 2017. How many likes did I get?:Purpose moderates links between positive social media feedback and self-esteem.

Journal of Experimental Social Psychology

69 (2017), 232–236.[17] Stevie Chancellor, Andrea Hu, and Munmun De Choudhury. 2018. Norms mat-ter: contrasting social support around behavior change in online weight losscommunities. In

Proceedings of CHI . 1–14.[18] Eshwar Chandrasekharan, Mattia Samory, Shagun Jhaver, Hunter Charvat, AmyBruckman, Cliff Lampe, Jacob Eisenstein, and Eric Gilbert. 2018. The Internet’sHidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso,and Macro Scales.

Proceedings of CSCW (2018).[19] Jonathan P Chang and Cristian Danescu-Niculescu-Mizil. 2019. Trouble on thehorizon: Forecasting the derailment of online conversations as they develop. In

Proceedings of EMNLP .[20] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.In

Proceedings of KDD . [21] Justin Cheng, Michael Bernstein, Cristian Danescu-Niculescu-Mizil, and JureLeskovec. 2017. Anyone can become a troll: Causes of trolling behavior in onlinediscussions. In

Proceedings of CSCW . ACM, 1217–1230.[22] Minje Choi, Luca Maria Aiello, Krisztián Zsolt Varga, and Daniele Quercia. 2020.Ten social dimensions of conversations and relationships. In

Proceedings of TheWeb Conference . 1514–1525.[23] Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012.Echoes of power: Language effects and power differences in social interaction.In

Proceedings of WWW . ACM, 699–708.[24] Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec,and Christopher Potts. 2013. A computational approach to politeness withapplication to social factors. In

Proceedings of ACL .[25] Sebastian Deri, Jeremie Rappaz, Luca Maria Aiello, and Daniele Quercia. 2018.Coloring in the links: Capturing social ties as they are perceived.

Proceedings ofCSCW (2018).[26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert:Pre-training of deep bidirectional transformers for language understanding. In

Proceedings of NAACL .[27] Steve Duck. 2007.

Human relationships . Sage.[28] Maeve Duggan. 2017. Online harassment 2017. (2017).[29] Robert A Emmons and Michael E McCullough. 2004.

The psychology of gratitude .Oxford University Press.[30] Facebook Research. 2019. Instagram Request for Proposals for Well-being andSafety Research. https://research.fb.com/programs/research-awards/proposals/instagram-request-for-proposals-for-well-being-and-safety-research/.[31] Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hatespeech in text.

ACM Computing Surveys (CSUR)

51, 4 (2018), 85.[32] Jeremy A Frimer, Karl Aquino, Jochen E Gebauer, Luke Lei Zhu, and HarrisonOakes. 2015. A decline in prosocial language helps explain public disapproval ofthe US Congress.

Proceedings of the National Academy of Sciences

Journal of personality and social psychology

Custodians of the Internet: Platforms, content moderation,and the hidden decisions that shape social media . Yale University Press.[35] Lorenz Goette, David Huffman, and Stephan Meier. 2012. The impact of socialties on group interactions: Evidence from minimal groups and randomly assignedreal groups.

American Economic Journal: Microeconomics

4, 1 (2012), 101–15.[36] David Greatbatch and Timothy Clark. 2003. Displaying group cohesiveness:Humour and laughter in the public lectures of management gurus.

Humanrelations

56, 12 (2003), 1515–1544.[37] Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Conti, and N Asokan. 2018. AllYou Need is" Love" Evading Hate Speech Detection. In

Proceedings of the 11thACM Workshop on Artificial Intelligence and Security . 2–12.[38] Clayton J Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based modelfor sentiment analysis of social media text. In

Proceedings of ICWSM .[39] Yunhao Jiao, Cheng Li, Fei Wu, and Qiaozhu Mei. 2018. Find the conversationkillers: A predictive study of thread-ending posts. In

Proceedings of the WebConference .[40] Adam N Joinson. 2001. Self-disclosure in computer-mediated communication:The role of self-awareness and visual anonymity.

European journal of socialpsychology

31, 2 (2001), 177–192.[41] Adam N Joinson and Carina B Paine. 2007. Self-disclosure, privacy and theInternet.

The Oxford handbook of Internet psychology

Self-dislosure: An Experimental Analysis of the TransparentSelf . Wiley Interscience.[43] David Jurgens, Eshwar Chandrasekharan, and Libby Hemphill. 2019. A Just andComprehensive Strategy for Using NLP to Address Online Abuse. In

Proceedingsof the 57th Annual Meeting of the Association for Computational Linguistics (ACL) .[44] Roberta L Knickerbocker. 2003. Prosocial behavior.

Center on Philanthropy atIndiana University (2003), 1–3.[45] Varada Kolhatkar and Maite Taboada. 2017. Constructive language in newscomments. In

Proceedings of the Workshop on Abusive Language Online .[46] Varada Kolhatkar, Nithum Thain, Jeffrey Sorensen, Lucas Dixon, and MaiteTaboada. 2020. Classifying Constructive Comments.

First Monday. (2020).[47] Wojciech Kulesza, Dariusz Dolinski, Avia Huisman, and Robert Majewski. 2014.The echo effect: The power of verbal mimicry to influence prosocial behavior.

Journal of Language and Social Psychology

33, 2 (2014), 183–201.[48] Srijan Kumar, Justin Cheng, and Jure Leskovec. 2017. Antisocial behavior onthe web: Characterization and detection. In

Proceedings of the 26th InternationalConference on World Wide Web Companion . 947–950.[49] Srijan Kumar, Xikun Zhang, and Jure Leskovec. 2019. Predicting dynamic em-bedding trajectory in temporal interaction networks. In

Proceedings of KD . ACM,1269–1278.[50] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, PiyushSharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learningof language representations. arXiv preprint arXiv:1909.11942 (2019).

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Bao,* Wu,* Zhang,* Chandrasekharan, and Jurgens [51] David Lazer. 2015. The rise of the social algorithm.

Science

Proceedings of ICWSM .[53] J Nathan Matias. 2019. Preventing harassment and increasing group participationthrough social norms in 2,190 online science discussions.

Proceedings of theNational Academy of Sciences

Psychological bulletin

Current directions in psychological science

17, 4 (2008), 281–285.[57] David N Milne, Glen Pink, Ben Hachey, and Rafael A Calvo. 2016. Clpsych 2016shared task: Triaging content in online peer-support forums. In

Proceedings of theThird Workshop on Computational Linguistics and Clinical Psychology . 118–127.[58] Paul Mussen and Nancy Eisenberg-Berg. 1977.

Roots of caring, sharing, andhelping: The development of pro-social behavior in children.

WH Freeman.[59] Courtney Napoles, Aasish Pappu, and Joel Tetreault. 2017. Automatically identi-fying good conversations online (yes, they do exist!). In

Proceedings of ICWSM .[60] Courtney Napoles, Joel Tetreault, Aasish Pappu, Enrica Rosato, and Brian Proven-zale. 2017. Finding good conversations online: The Yahoo News annotatedcomments corpus. In

Proceedings of the 11th Linguistic Annotation Workshop .[61] Amit Navindgi, Caroline Brun, Cécile Boulard Masson, and Scott Nowson. 2016.Steps toward automatic understanding of the function of affective language insupport groups. In

Proceedings of The Fourth International Workshop on NaturalLanguage Processing for Social Media . 26–33.[62] Kate G Niederhoffer and James W Pennebaker. 2002. Linguistic style matchingin social interaction.

Journal of Language and Social Psychology

21, 4 (2002),337–360.[63] Ariana Orvell, Ethan Kross, and Susan A Gelman. 2017. How “you” makesmeaning.

Science

Journal of NonverbalBehavior

27, 3 (2003), 183–200.[65] Mitchell J Prinstein and Antonius HN Cillessen. 2003. Forms and functions ofadolescent peer aggression associated with high levels of peer status.

Merrill-Palmer Quarterly (1982-) (2003), 310–342.[66] Sarah T Roberts. 2014.

Behind the screen: The hidden digital labor of commercial con-tent moderation . Ph.D. Dissertation. University of Illinois at Urbana-Champaign.[67] Koustuv Saha, Eshwar Chandrasekharan, and Munmun De Choudhury. 2019.Prevalence and Psychological Effects of Hateful Speech in Online College Com-munities.. In

Proceedings of Web Science .[68] Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019.The risk of racial bias in hate speech detection. In

Proceedings of ACL .[69] Lauren E Scissors, Alastair J Gill, and Darren Gergle. 2008. Linguistic mimicryand trust in text-based CMC. In

Proceedings of CSCW .[70] Ashish Sharma, Adam S Miner, David C Atkins, and Tim Althoff. 2020. A Com-putational Approach to Understanding Empathy Expressed in Text-Based MentalHealth Support. In

Proceedings of EMNLP .[71] Lee Sproull. 2011. Prosocial behavior on the net.

Daedalus

Negotiation and Conflict Management Research

1, 3 (2008), 263–281.[73] Jean M Twenge, Roy F Baumeister, C Nathan DeWall, Natalie J Ciarocco, andJ Michael Bartels. 2007. Social exclusion decreases prosocial behavior.

Journal ofpersonality and social psychology

92, 1 (2007), 56.[74] Twitter. 2018. Twitter health metrics proposal submission. https://blog.twitter.com/en_us/topics/company/2018/twitter-health-metrics-proposal-submission.html.[75] Bertie Vidgen, Alex Harris, Dong Nguyen, Rebekah Tromble, Scott Hale, andHelen Margetts. 2019. Challenges and frontiers in abusive content detection. In

ACL .[76] Rob Voigt, Nicholas P Camp, Vinodkumar Prabhakaran, William L Hamilton,Rebecca C Hetey, Camilla M Griffiths, David Jurgens, Dan Jurafsky, and Jen-nifer L Eberhardt. 2017. Language from police body camera footage shows racialdisparities in officer respect.

Proceedings of the National Academy of Sciences

Proceedings of the 6th Workshopon Cognitive Modeling and Computational Linguistics . 9–18.[78] Zijian Wang and David Jurgens. 2018. It’s going to be okay: Measuring Access toSupport in Online Communities. In

Proceedings of EMNLP .[79] Zeerak Waseem, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017.Understanding abuse: A typology of abusive language detection subtasks. In

Proceedings of the First Workshop on Abusive Language .[80] Thomas Ashby Wills. 1991. Social support and interpersonal relationships. (1991). [81] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, JoeDavison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural LanguageProcessing. In

Proceedings of EMNLP .[82] Michelle F Wright and Yan Li. 2011. The associations between young adults’face-to-face prosocial behaviors and their online prosocial behaviors.

Computersin Human Behavior

27, 5 (2011), 1959–1962.[83] Michelle F Wright and Yan Li. 2012. Prosocial behaviors in the cyber context. In

Encyclopedia of cyber behavior . IGI Global, 328–341.[84] Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personalattacks seen at scale. In

Proceedings of the Web Conference .[85] Justine Zhang, Jonathan P Chang, Cristian Danescu-Niculescu-Mizil, Lucas Dixon,Yiqing Hua, Nithum Thain, and Dario Taraborelli. 2018. Conversations goneawry: Detecting early signs of conversational failure. In

Proceedings of ACL .[86] Naitian Zhou and David Jurgens. 2020. Condolences and Empathy in OnlineCommunities. In

Proceedings of EMNLP . A PROSOCIAL METRICS

This section describes the features, training, and setup for classifiersand regressors that estimate specific prosocial metrics.

A.1 Information Sharing

Information-sharing comments were identified using a classifiertrained on heuristically-labeled data. Positive examples of informa-tion sharing were drawn from 18 question-focused subreddits whereindividuals post questions and receive potentially-informative replies(e.g., r/whatisthisthing and r/AskReddit); these subreddits covermultiple topics to prevent overfitting to sharing just one typeof information. Information sharing comments were drawn fromJanuary–March of 2018 from posts that contained at least one ques-tion; replies to these questions receiving a score > not in these subreddits. Our dataset consists of 55,542 infor-mative comments and 300,226 comments from non-informativecommunities. This class skew was intentionally left imbalancedto simulate the real-life scenario where most comments are not information-sharing . We fit a logistic regression classifier on theunigram and bigram features with five-fold cross-validation. Thehyperparameter for the minimum n-gram frequency was variedbetween values [ , , , , ] . The final model obtained anF1-score of 0.713. We selected a decision threshold of ≥ A.2 Laughter

Laughter is detected by identifying colloquial internet expressionssignalling laughter, e.g., “haha” or “lol.” As these forms may repeator have variation, e.g., “hahhahaaha” we use a regex to detect them: \ba*h+a+h+a+(h+a+)*?h*\b|\bl+o+l+(o+l+)*?\b|\bh+e+h+e+(h+e+)*?h*\b onversations Gone Alright: Quantifying and Predicting Prosocial Outcomes in Online Conversations WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

A.3 Mentoring

We built a classifier with positive examples of mentoring drawnfrom advice-based subreddits where users post questions and thecommunity responds with advice to those questions (appearing asTlc). Negative examples were randomly drawn from a Tlc madein all other subreddits. We considered communities containing theword “Advice” in their name (e.g., r/legaladvice, r/relationship_advice,and r/mechanicadvice), excluding those with the word “Bad.”. Com-pared to information-sharing subreddits, answers in mentoring sub-reddits are typically subjective in nature. We generated 500,000negative examples using reservoir sampling. We processed andpurified this dataset with the same pipeline as A.1, which resultedin a 79,430 positive examples of mentoring and 299,006 negativeexamples. Our logistic regression classifier for mentoring predic-tion has an F1-score of 0.762 and we manually adjust the decisionthreshold to ≥ A.4 Gratitude

To detect gratitude in replies, we use a fixed lexicon of words andphrases, which manual inspection showed had high precision wheninterpreting whether the responding user was expressing gratitude.Gratitude words are “thanks,” “contented,” and “blessed.” Gratitudephrases are “thank you,” “thankful for,” “grateful for,” “greatful for,”“my gratitude,” “i appreciate,” “make me smile,” “I super appreciate,”“i deeply appreciate,” “i really appreciate,” and “bless your soul.”

A.5 Esteem Enhancement

Compliments were identified using a rule-based procedure to selectparts of comments referring to the user being replied to and thentesting whether the sentiment around that reference was positive.An initial set of candidates was identified by looking for directmentions of “you is/are” or “your [word] is/are.” We then filter outall candidates containing where “you” is immediately preceded by“if” or “when” as analysis showed these constructions were likelyto invoke the generic sense of you [10, 63] and not refer directly tothe user in the parent comment. From the remaining, we extract thefive words following our matched phrase and score the sentimentusing VADER [38]. We use the compound sentiment score from the vaderSentiment library as an aggregate estimate of the positivitytoward the parent comment’s user. The minimum threshold forsentiment was set at 0.7 after reviewing several hundred sentimentscores showed this resulted in few false positives.

A.6 Donations

Fundraising and Donation behavior is measured by counting howmany times a URL with one of these base domains are mentionedin the total conversation. The following URLs were drawn popularcharity, fundraising, and donation organizations: gofundme.com,indiegogo.com, causes.com, kickstarter.com, patreon.com, circleup.com, lendingclub.com, fundly.com, donatekindly.org, givecampus.com, snap-raise.com, snowballfundraising.com, bonfire.com, crowdrise.com, dojiggy.com, mightycause.com, depositagift.com, wemakeit.com, donorschoose.org, fundrazr.com, rallyme.com, startsomegood.com, diabetes.org, humanesociety.org, cancer.org, nwf.org,worldwildlife.org, habitat.org, oxfam.org, unicefusa.org, wish.org,nature.org, aspca.org, savethechildren.org, wfp.org, hrc.org, hrw. loadings of (PC0)

Replies_informative_countReplies_links_countReplies_max_depthReplies_sum_scoreReplies_total_numberTop_comment_article_accommodationTop_comment_certain_accommodationTop_comment_conj_accommodationTop_comment_discrep_accommodationTop_comment_excl_accommodationTop_comment_incl_accommodationTop_comment_ipron_accommodationTop_comment_negate_accommodationTop_comment_quant_accommodationTop_comment_tentat_accommodationReplies_advice_countReplies_laughter_countReplies_gratitude_countReplies_informative_URL_countReplies_i_language_countReplies_compliments_countReplies_untuned_toxicity_children_countTop_comment_direct_childrenReplies_distinct_pairs_of_sustained_conversationReplies_max_turns_of_sustained_conversationsReplies_untuned_non_toxic_percentage c o n v e r a t i o n o u t c o m e s loadings of (PC0) Figure 7: PCA Component 0 Loadings Across Metrics. loadings of (PC1)

Replies_informative_countReplies_links_countReplies_max_depthReplies_sum_scoreReplies_total_numberTop_comment_article_accommodationTop_comment_certain_accommodationTop_comment_conj_accommodationTop_comment_discrep_accommodationTop_comment_excl_accommodationTop_comment_incl_accommodationTop_comment_ipron_accommodationTop_comment_negate_accommodationTop_comment_quant_accommodationTop_comment_tentat_accommodationReplies_advice_countReplies_laughter_countReplies_gratitude_countReplies_informative_URL_countReplies_i_language_countReplies_compliments_countReplies_untuned_toxicity_children_countTop_comment_direct_childrenReplies_distinct_pairs_of_sustained_conversationReplies_max_turns_of_sustained_conversationsReplies_untuned_non_toxic_percentage c o n v e r a t i o n o u t c o m e s loadings of (PC1) Figure 8: PCA Component 1 Loadings Across Metrics. org, nationalmssociety.org, redcross.org, mentalhealthamerica.net,amnesty.org, heart.org, crs.org, kiva.org, fsf.org, rotary.org, alz.org,doctorswithoutborders.org, unitedway.org, and cancer.org.

A.7 Politeness

Two prior datasets exist with politeness ratings. The data of Danescu-Niculescu-Mizil et al. [24] contains z-scored ratings of politenessfor questions, whereas the data for Wang and Jurgens [78] containsratings for statements of a variety of lengths rated on a scale in[1,5] where 3 indicates neither polite nor impolite. To build a ro-bust classifier for Reddit, we combine both datasets, and rescaleboth datasets to be within [-1,1]. To obtain the politeness regressor,we first pre-train a BERT-based model [26] on Reddit data usingmasked language modeling. Then, we fine-tune those parametersusing the Adam optimizer with a learning rate of 0.00002. The max

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Bao,* Wu,* Zhang,* Chandrasekharan, and Jurgens

Our Model XGBoost Our Model XGBoostArt 1.88 1.75 Pictures 2.05 1.93Culture 2.14 2.01 Music 2.00 2.05TV 2.16 2.09 Lifestyle 2.16 2.19Sports 2.27 2.19 Movies 2.50 2.29Gaming 2.32 2.30 Humor 2.54 2.39Technology 2.55 2.50 Meta 2.77 2.59Discussion 2.70 2.62 Location 2.70 2.63Info 2.85 2.79 Science 3.07 2.97

Table 5: The MSE of prosocial forecasts within different sub-reddit categories shows that our two top models both attainhigher performance in communities whose discussion re-lates to pop-culture such as Movies, Art, and Culture. sequence length is set to 128. While training, we adopt

𝑀𝑆𝐸 as theloss function and a five-fold cross validation strategy is utilizedwhen evaluating model’s performance. Each model was run at most5 epochs and we took the one whose average Pearson 𝑟 across fivefolds was the highest for further usage. The final model obtained a 𝑟 =0.66 with human judgments from both datasets. A.8 Supportiveness

The supportiveness regressor was trained in a similar manner asthe politeness regressor, but used the only available dataset ofWang and Jurgens [78] for estimating support. Support is scoredwithin [-1,1] with a rating of 0 indicating neither supportive norunsupportive. A BERT model is first pre-trained on Reddit datausing masked language modeling and then five-fold cross-validationis done where each fold is fine-tuned on these support ratings. Weselect the model with the highest rating across folds. The finalmodel reached 𝑟 =0.58 with human judgments in their data, whichsurpasses the state-of-the-art model results reported in Wang andJurgens [78] for their best model. B ADDITIONAL PCA ANALYSIS

Multiple prosocial behaviors may occur in the same conversationand to capture their regular co-occurrence, we use Principal Com-ponent Analysis (PCA) to identify the main forms of variation. PCAis computed on a matrix where each conversation is a row andthe columns contain the value of each prosocial metric Shown inFigure 2 (main paper), the first principle component explains 57.4%of the variance in the data, with all other components explainingfar less. The loadings of this first principle component (Figure 7)shows that this component is loading on all of the prosocial be-haviors (and negatively on the antisocial behaviors) indicating thatit effectively summarizes our studied prosocial behaviors withina single metric. As a comparison, we show loadings for the sec-ond largest component in Figure 8, which explains ∼

10% of thevariance; this component does not have any clear association withprosocial behavior and seems to match conversations with highscores but little conversation. Similar trends were observed for allother components, which lacked a clear association with prosocialbehavior, suggesting that a single metric can be a reasonable proxyfor summarizing the prosocial behaviors.

Hyperparameter Valuebooster type gbtreelearning rate 𝜂 𝛾 𝜆 𝛼 Table 6: Hyperparameters of the XGBoost Model

Hyperparameter Our Model Frozen Albertlearning rate 1e-5 1e-4number of epochs 2 5L2 penalty (c) 1e-6 1e-6pretrained albert type (for Tlc texts) albert-base-v2 albert-base-v2pretrained albert type (for post texts) albert-base-v2 albert-base-v2dropping probability of the dropout layer 0.5 0.5subreddit embedding dimension 16 16learning rate scheduling linear linearoptimizer AdamW AdamWrandom seed 42 42

Table 7: Hyperparameters of our model (left) and the modelusing frozen weights for the base Albert Model (right).

C MODEL HYPERPARAMETERS

XGBoost.

Hyperparameters for the XGBoost model are shownin Table 6. We tuned the learning rate 𝜂 through grid search (log-linearly) in the ranges between 0.1 and 0.001. We regard models thathave the lowest mean square loss as the best model. The model wastrained on cpu-s for 27 hours, 21 minutes and 3 seconds, validatedon the validation set every 10 iterations. The mean square error ofthe above model on the validation set is 1.49, and its 𝑅 is 0.27. Our Albert-based Model.

Hyperparameters for our Albert-basedmodel are shown in Table 7. We tuned the learning rate and weightdecay using grid search (log-linearly) in the ranges from 0.1 to0.001, and from 1e-4 to 1e-7 respectively. We regarded models thathave the lowest mean square loss as the best model. The modelwas trained on a single GeForce GTX 2080 Ti device for 45 hours,53 minutes and 14 seconds, validated on a randomly-sampled val-idation set every 3351 iterations (40 times per epoch). The meansquare error of the above model on the validation set is 2.24, and its 𝑅 is 0.27. For the model using frozen Albert-parameters from theHugging Face transformer library [81], parameters are also shownin Table 7. This model has the same architecture, but differs only infine-tuning and which weights are frozen. The model was trainedon a single GeForce GTX 2080 Ti device for 55 hours, 34 minutesand 9 seconds, validated on a randomly-sampled validation set ev-ery 482 iterations (40 times per epoch). The mean square error ofthe above model on the validation set is 2.50, and its 𝑅2