[PDF] RECAST: Enabling User Recourse and Interpretability of Toxicity Detection Models with Interactive Visualization

Abstract

With the widespread use of toxic language online, platforms are increasingly using automated systems that leverage advances in natural language processing to automatically flag and remove toxic comments. However, most automated systems -- when detecting and moderating toxic language -- do not provide feedback to their users, let alone provide an avenue of recourse for these users to make actionable changes. We present our work, RECAST, an interactive, open-sourced web tool for visualizing these models' toxic predictions, while providing alternative suggestions for flagged toxic language. Our work also provides users with a new path of recourse when using these automated moderation tools. RECAST highlights text responsible for classifying toxicity, and allows users to interactively substitute potentially toxic phrases with neutral alternatives. We examined the effect of RECAST via two large-scale user evaluations, and found that RECAST was highly effective at helping users reduce toxicity as detected through the model. Users also gained a stronger understanding of the underlying toxicity criterion used by black-box models, enabling transparency and recourse. In addition, we found that when users focus on optimizing language for these models instead of their own judgement (which is the implied incentive and goal of deploying automated models), these models cease to be effective classifiers of toxicity compared to human annotations. This opens a discussion for how toxicity detection models work and should work, and their effect on the future of online discourse.

Full PDF

1181Recast: Enabling User Recourse and Interpretability ofToxicity Detection Models with Interactive Visualization

AUSTIN P WRIGHT,

Georgia Institute of Technology, USA

OMAR SHAIKH,

Georgia Institute of Technology, USA

HAEKYU PARK,

Georgia Institute of Technology, USA

WILL EPPERSON,

Georgia Institute of Technology, USA

MUHAMMED AHMED,

Mailchimp, USA

STEPHANE PINEL,

Mailchimp, USA

DUEN HORNG (POLO) CHAU,

Georgia Institute of Technology, USA

DIYI YANG,

Georgia Institute of Technology, USA

Fig. 1. The Recast user interface. A. Toxicity score of overall input text shows edits’ effect on toxicity in realtime. B. Words whose possible alternatives have strong potential for toxicity reduction are highlighted inyellow. C. Usage guide for Recast’s capabilities. D. Underline opacity visualizes model’s attention on words,including those without alternatives, to inform users about which words contribute important context (e.g.,“kid” is underlined, because toxicity towards a kid contributes to the toxic context.) E. Showing the toxicityscore of selected text allows users to localize the sources of toxicity and search for the regions most importantto edit. F. Hovering over highlighted toxic text displays alternative wording in a pop-up.

Authors’ addresses: Austin P WrightGeorgia Institute of Technology, USA, [email protected] ; Omar ShaikhGeorgiaInstitute of Technology, USA, [email protected] ; Haekyu ParkGeorgia Institute of Technology, USA, [email protected] ; Will EppersonGeorgia Institute of Technology, USA, [email protected] ; Muhammed AhmedMailchimp, USA, [email protected] ; Stephane PinelMailchimp, USA, [email protected] ; Duen Horng (Polo)ChauGeorgia Institute of Technology, USA, [email protected] ; Diyi YangGeorgia Institute of Technology, USA, [email protected] .Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected]. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. a r X i v : . [ c s . H C ] F e b With the widespread use of toxic language online, platforms are increasingly using automated systems thatleverage advances in natural language processing to automatically flag and remove toxic comments. However,most automated systems—when detecting and moderating toxic language—do not provide feedback to theirusers, let alone provide an avenue of recourse for these users to make actionable changes. We present ourwork, Recast, an interactive, open-sourced web tool for visualizing these models’ toxic predictions, whileproviding alternative suggestions for flagged toxic language. Our work also provides users with a new path ofrecourse when using these automated moderation tools. Recast highlights text responsible for classifyingtoxicity, and allows users to interactively substitute potentially toxic phrases with neutral alternatives. Weexamined the effect of Recast via two large-scale user evaluations, and found that Recast was highly effectiveat helping users reduce toxicity as detected through the model. Users also gained a stronger understanding ofthe underlying toxicity criterion used by black-box models, enabling transparency and recourse. In addition,we found that when users focus on optimizing language for these models instead of their own judgement(which is the implied incentive and goal of deploying automated models), these models cease to be effectiveclassifiers of toxicity compared to human annotations. This opens a discussion for how toxicity detectionmodels work and should work, and their effect on the future of online discourse.CCS Concepts: •

Human-centered computing → Collaborative and social computing systems andtools .Additional Key Words and Phrases: content moderation, toxicity detection, interactive visualization, naturallanguage processing, intervention

ACM Reference Format:

Austin P Wright, Omar Shaikh, Haekyu Park, Will Epperson, Muhammed Ahmed, Stephane Pinel, Duen Horng(Polo) Chau, and Diyi Yang. 2021. Recast: Enabling User Recourse and Interpretability of Toxicity DetectionModels with Interactive Visualization .

Proc. ACM Hum.-Comput. Interact.

5, CSCW1, Article 181 (April 2021),26 pages. https://doi.org/10.1145/3449280

WARNING: This paper contains some content which is offensive in nature.

Toxicity online is widespread: a 2015 user survey on online social network platform Reddit found that50% of negative responses were attributed to hateful or offensive content [2]; however, addressingtoxicity through automated means is not trivial—as there must always be a choice of determiningwhat language should be removed and what should not. The same survey also found that 35% ofcomplaints from extremely dissatisfied users were about heavy handed moderation and censorship.With the inherent trade-offs baked into content moderation, it is challenging to find a middleground for this issue. Furthermore, the extreme scale of social media interactions [46] exacerbatesthese challenges. These issues have lead to the development of automatic toxicity detection modelssuch as the Google Perspective API [30].Introducing automation, however, raises its own challenges. Machine learning models, responsi-ble for detecting and moderating toxic language, can themselves be flawed [18, 40]. When in the pastusers could rely on clear community standards from human moderation (or at the least an ability tocommunicate with a moderator), the adoption of fully automated systems make human-facilitatedmoderation much more difficult. Moderators who do not understand how automated tools workmay not be able to contribute as much after these tools are adopted. As platforms rely more heavilyon automated systems for moderation, users also receive less feedback and might not be able toclearly connect their behaviors with community standards [22]. This fundamentally reduces the © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.2573-0142/2021/4-ART181 $15.00 https://doi.org/10.1145/3449280

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:3 effectiveness of automated moderation, as users cannot learn what they did wrong especially ifthey are unfamiliar with the language or social norms of a platform.Without feedback or explanation, users of online forums that use toxicity detection systemsbased on black-box NLP models might question how their language is being examined. In suchscenarios, there is no way to interpret why the model considers language toxic. Without toolsproviding an avenue of recourse [56] designed for actual end users to audit what is being detectedand make actionable changes to their language, people are disempowered to participate in discourseonline. Furthermore, without the ability to detect when a model is falsely flagging language due toeither linguistic limitations or social biases, the work of finding inaccuracies and correcting modelsis left entirely to the unrepresentative population of machine learning researchers and softwareengineers. Therefore, providing end-users the ability to interactively audit the models thataffect them will help democratize the improvement of these models.

Finally, interactive auditing opens an avenue for recourse. Given an explanation for how amodel works, users can re-evaluate writing toxic text, increase awareness of potential limitationsin toxicity detection models, and inform people who are unaware of certain toxic jargon. Black boxmodels, however, are impossible to interrogate. These models provide end-users with little capacityto observe their underlying decision-making processes. Highlighting features that contribute to amodel’s output provides users with concrete evidence when pursuing recourse.We address these challenges by developing an interactive tool called

Recast , which allowsfor the interrogation of toxicity detection models through counterfactual alternative wording andattention visualization. Recast’s design does not require any expertise in machine learning fromusers, and enables them to visualize their language through the eyes of the algorithm. To thisend, the primary focus of Recast is allowing users to visualize where and how a model detectstoxicity within a specific piece of text, make actionable changes to their language to reduce toxicity as determined by the model , and gain generalizable insights into how the model works to informfuture language and spur potential changes if flaws are found. Furthermore, we study the effects ofallowing users to experiment with Recast, analyzing how their language changes as they are moreaware of the model. To sum up, our contributions are:(1)

Recast , an interactive system allowing users to dynamically interrogate the classification oftoxicity by visualizing its sources within a piece of text and examining alternative wordings.This provides users a method to understand and interact with toxicity detection models.(2)

Experimental Findings

By using Recast as a means for users to understand models andoptimize language more efficiently, we not only evaluate the effectiveness of Recast, butalso study potential long term effects on discourse—caused by structural incentives fromautomated toxicity filters. We find that as users learn to optimize their language (a necessityto avoid being censored as these models become more prevalent), human labeled toxicity can increase compared to naïve editing. Thus, this work provides the first large scale quantitativeevaluation of toxicity detection models and their potential effects on online discourse.(3)

Open Source Implementation of Recast that enables broad access and future work. Re-cast provides a model agnostic framework for analysis of toxicity detection models by notonly model developers, but domain users and human moderators. Recast can be used so thatall relevant parties can become better aware of emerging issues in these models. We haveuploaded our source code as supplementary material. We will immediately make the codepublic on GitHub upon publication of this work.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021.

There has been extensive research on human-managed online content moderation [15, 42], especiallyrelating to the effect of transparency from the perspective of the end user [21, 24] and the importanceof explanations [23]. Various works have called for the development of tools that support, ratherthan supplant, the efforts of human moderators [23, 41, 42]. Furthermore, several works havefocused on studying the development of new automated moderation systems [6, 31], and theireffect on the dynamics of the platforms that utilize them [14]. However, most of this research hasfocused on rule based systems, such as the Reddit Auto-Moderator [22]. In contrast, work that hasused statistical machine learning approaches, possibly with the exception of Chandrasekharanet al. [6], has often focused on accuracy and over transparency or moderator experience [3]. In thiswork we aim to study the effect of these systems from the perspective of the end user as opposedto moderators managing the system. In particular, we look at the effects of deep neural networkbased moderation systems, which are notoriously difficult to interpret, in comparison to rule basedsystems. This also introduces issues regarding the perceived legitimacy of platforms, as neuralmodels lack many of the core procedural values required for a platform to be seen as fair, suchas due process, transparency, and openness [50]. In some cases, linguistic variation also allowsusers to bypass automated content moderation systems [5]. Understanding what a model perceivesas toxic is integral to evaluating its efficacy. Therefore, a key goal of this work is to provide oroutline implicit procedures used by automated toxicity detection models that can bringhigher levels of transparency in these systems.

Recent work has shown that even limitedtransparency and explanations from automated systems can be as effective as explanations fromhuman moderators, which “suggest an opportunity for deploying automated tools at a higher ratefor the purpose of providing explanations” [23]. We also hope that, similar to Matias and Mou [35],this platform can serve as a host for experiments in visualizing the predictions of different types ofalgorithms and the impact of these visualizations on user behaviour.

Many researchers have explored and built both social and technical approaches to identifying,reducing, and combating hateful content online [4, 7, 13, 34, 42, 47]. One of the main solutionsthese research or platforms propose is to block, ban or suspend the message or the user account.Although removing content or banning relevant users who are perceived as toxic by automatedmodels may reduce their impact to some extent, it may also eliminate legitimate or importantspeech. A number of interventions have been developed in the CSCW/CHI community to combathateful content and harassment [32]. For instance, Seering et al. [42] designed a system usingpsychologically “embedded” CAPTCHAs containing stimuli intended to prime positive emotionsand mindsets, influencing discussion positively in online forums, while Taylor et al. [51] exploredcues that could encourage bystander interventions. Smith et al. [45] raises concerns about conflictbetween automated moderation systems and human guidelines, studying Wikipedia’s currentautomated moderation system. Chandrasekharan et al. [6] introduced a sociotechnical moderationsystem for Reddit called Crossmod to help detect and signal comments that would be removedby moderators. In contrast, our proposed tool Recast aims at influencing discourse more directlyat the end-user stage to promote fairness and interpretability. Citron and Norton [8] examineda number of efforts on hate speech and identified three ways of responding to hate speech: (1)removing hateful content, (2) directly rebutting hate speech, and (3) educating and empoweringcommunity users. Our proposed tool aligns well with (3), as Recast enables users to see the toxicitylevels of their content transparently, offers examples of instances when content is and is not toxic,

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:5 and provides model visualization and reasoning. Furthermore, Recast helps users with alternativewording, helping them express similar ideas in non-toxic ways.

Driving the development of many new toxicity detection methods are recent advances in the fieldof NLP. In particular, the introduction of massive pretrained Transformer [55] models such as BERT[10] have accelerated the state of the art in many downstream tasks like toxic language classification.The learned representation from these models are often used as the input for a relatively simplemodel which can be trained on a specific task like toxicity classification. This process is verycommon as it hugely reduces the amount of training and data required to achieve good results.However, this does mean that any potential issues present in BERT (or other pretrained Transformermodels) are inherited by the fine-tuned model. Issues include bias [1, 33, 40], as well as a generallack of semantic understanding [11]. Such issues are particularly important when consideringnotions of toxicity, which can take many forms, some of which are linguistically pleasant butsemantically abhorrent. In fact, work has shown that the Perspective API is susceptible to the sameadversarial attacks that fool other NLP models [18]. To address this, there has been some progresson building frameworks to automatically mitigate bias in text directly [39]; however many problemsin this space remain open. Despite these issues, deep neural network based models outperformmore traditional rule based models in detecting biased language [19], and are increasingly beingdeployed and thus carrying these flaws into socially-important real-world situations.

In order to express information about the model to non-technical end users, our tool fits generallywithin the tradition of visual analytics for deep learning explainability [16]. From a visual analyticsperspective, various interactive tools have been built for understanding the internal mechanisms ofgeneral purpose natural language processing systems. However, (a) there has been little work onunderstanding the function and impact of toxicity detection systems specifically and (b) these toolsare aimed mainly towards developers. Some tools such as SANVis [37] and exBert [17] allow forinteractive exploration of the attention mechanisms in Transformer models. The attention scoresassociated with each word allow connections between words and emphasis of certain words incontext to be highlighted. These approaches take a generally static view of the data, where the toolis viewed as a method to explore existing data.However, dynamic visualizations that allow for user experimentation can assist in the under-standing of a machine learning model [27]. Some tools take a more active approach to explainmodels through counterfactuals [57]. Others, like Errudite [59], allow users to test their own hy-potheses with respect to the true error distribution on the entire dataset. Some techniques attemptto adversarially perturb language input to a model [44, 60] in order to change its classification; othermethods use a human-in-the-loop design [28]. In this work, we synthesize these two paradigms bypassively visualizing model attention, while also enabling interactivity with AI-guided and human-driven counterfactuals. Importantly, we design our tool in the context of automated moderationsystems, prioritizing usability from the perspective of non-specialists.Finally, tools like The Perspective API also offer limited interpretability, allowing platforms toembed text editing areas with a small widget that notifies users with a binary output (toxic/non-toxic) when their input exceeds a toxicity threshold. It also updates as users edit, and allows easyfeedback for users to notify that they think the model was incorrect. This provides some of thebenefits of Recast, however it lacks more extensive visual information to help lead users to specificproblems in their text, which maintains the black box nature of the model.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021.

In this section, we motivate the design of the Recast tool through a formative user survey, andformalize a series of design goals from the concerns raised in our user survey.

To understand difficulties relating to toxicity moderation, and to outline user needs (especiallythose related to recourse), we surveyed 100 Amazon Mechanical Turk Workers with social mediaaccounts in the United States about their experience surrounding online toxicity and moderation. We asked how often users noticed toxic language on the internet and found that 62% respondednoticing toxicity either ‘often’ or ‘always’ compared to only 6% responding with either ‘rarely’ or‘never.’ Our survey also highlights the clear effect of racism as an undeniable component, with16% of non-white users reporting ‘always’ noticing toxicity compared to only 4% of white users(and only 2% for white males). Toxicity is not only noticed but caustic: we asked about the effecttoxicity had on users lives both online and offline, and found that overall 43% responded that toxiclanguage online had a ‘somewhat negative’ or ‘very negative’ effect on their lives offline. Thiseffect was also influenced by gender, with only 20% of males reporting negative effects while 52%of non-males reporting negative effects offline. We find that not only is toxic language pervasivebut it disproportionately and intersectionally harms already underprivileged groups.Given the groups’ overall clear negative experience of toxicity, we also wanted to understandhow they felt about the trade-off between free speech on online spaces and toxic language. In anopen ended question, 36 responses were highly supportive of strong moderation to reduce toxicity,with responses along the lines of “I think the tradeoff is well worth it. I am so tired of hearing foullanguage all the time.”

At the same time 31 were very skeptical of any moderation and highlysupportive of user freedom of speech, responding “I think free speech is important. So people cansay what they want even if it is toxic.”

These divided responses highlight the inherent difficulty ofbalancing the standards of content moderation and freedom of speech.Many neutral responses were additionally skeptical of automated systems in particular. Forexample, one participant noted that “in many platforms moderation is biased. The automatic toolsare not sufficient to remove toxic material from the site. These tools end up removing quality contentinstead of toxic ones.”

A common complaint from these responses claimed that addressing the issueof toxicity through simple models would exacerbate existing tensions between moderators andusers. To understand the specific problems users had with these automated systems we then askedabout feedback. Among participants who have had a post removed either by a moderator or anautomated system, we found that when removed by a human, 52% reported receiving feedback‘often’ or ‘always’, while only 36% reported receiving feedback from automated systems.Finally, to understand which areas users felt need the most improvement, we asked participantshow they would improve existing systems for online content moderation, and what features theywould want in a tool. Concretely, users noted wanting familiarity and similarity to existing toolsfor modifying language in spellcheck and auto-complete interfaces “ like auto correct but forlanguage vs. typos ” that could simply “ suggest other words ” .This aligned with a desire that any tool should be “something friendly and approachable that isn’ttoo intrusive or annoying.” Many of the responses were outright hostile to the concept of a tool for The survey took an average of 6 minutes to complete with compensation of $0.80, above the US Federal minimum wage.Workers were selected from the Amazon Mechanical Turk pool with filters to ensure they were within the United States,and held Reddit and Twitter accounts to ensure a certain familiarity with online discourse. Respondents had an average ageof 34 (standard deviation of 8 years), identified as 66% male, 33% female, and 1% nonbinary, and 75% White, 9% Asian, 8%black, 2% Latino, 1% Native American, and 5% other/unspecified.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:7 reducing toxicity for fear of censorship, meaning that any such tool would be most effective theless visible it is—similar to the notion of nudging in behavioural economics [52].Stemming from the same well justified fear, users emphasized that such a tool requires “an appealsprocess, information about why a post is removed, a rejection of a post with advice for fixing it beforeposting something” ; and “the ability to give feedback on the tool since it will almost certainly havefailure scenarios. The tool should also be able to work in real time, and have an excellent understandingof English.”

From our survey, we found that users indirectly recognize that current machine learningsystems do not have a perfect understanding of natural language. Thus, a more concrete way toprovide feedback is very important for such a system. Finally one user noted that they would liketo see the sources for a model, “guidelines.. lots of written comments of what is deemed toxic” , whichhelps justify the need for these models and datasets to be open source.

We synthesized the information from the study to identify the main design goals for Recast.

G1 Interpretation.

Users often feel as if they are unsure of the specific guidelines and require-ments they are asked to maintain, and may not know what about their language may causea post to be removed. Therefore it is important to provide explanations to users that areuseful not only for a specific comment but more generally for how the classifiers work.Ease of interpretability will enable users to build appropriate mental models for planninghow they use language in the future. These interpretations can provide the bedrock of anypotential action a user may want to make either on or off of the platform.

G2 User Driven.

In order to make sure users do not feel overly censored, we ensure that nodecision about editing text is to be made without the explicit choice of the user, and thata wide variety of options be presented to maximize the capability of the user to say whatthey mean. This differentiates Recast from fully automatic, end-to-end approaches, whichhave been presented for reducing bias or toxicity [39]. A trade-off with this design principleconcerns users who are determined to use toxic language. However, such users would notuse a tool like this in the first place if it prevented their ultimate desired language. In orderto understand how real users might use similar tools, and to study how even toxic usersinteract with moderation models, we prioritize a user-centered design.

G3 Minimalism.

In order to ensure accessibility for end users who are not familiar withcomplex data visualization paradigms, and to make sure that the tool is not overbearing orirritating, we aim to build a tool that minimizes extraneous views. This also has a trade-off ofprecluding more advanced or comprehensive user interfaces; for the purposes of this study,however, providing greater accessibility for users (for instance users whose first language isnot English) is an important consideration and thus is prioritised.

G4 Easy Feedback.

Anticipating that any model for detecting toxicity will be flawed, it isvaluable for users to be able to highlight erroneous classification easily to ensure they feelthey are being heard, and to improve the underlying model when possible.

G5 Accessibility.

To develop a tool that is accessible for users without specialized compu-tational resources, we deploy our tool using lightweight modern web technologies, andplace emphasis on ensuring our system runs efficiently for low-resourced users. We alsoopen-source our code to support reproducible research.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021.

Outside of the primary design goals of Recast, there are special considerations we need to giveto potential ethical issues . Recast enables users to edit text so that it is no longer be detected astoxic by a classification model. However it is highly possible that the resulting text can in realityremain toxic. Aspects of this work are similar in functionality to work done to generate adversarialexamples for text classification [28, 29]. Bad actors could potentially use Recast to pass truly toxiclanguage past existing filters.To avoid these scenarios, we have included controls within Recast that prevent it being used insuch a scenario where it detects explicit hate speech or uneditable/irredeemable toxicity. Manybad actors already have the ability to bypass models through the method of trial and error. Theend result of proliferation of a tool like Recast is not to improve the ability for bad actors to‘game the algorithm,’ as it is already being gamed [5]. For example, the Recast (Figure 1) is an online interactive tool with the primary focus of allowing users to visualizetoxicity within a text, make changes to reduce toxicity, and gain generalizable insights into howtoxicity classifier models work.At the same time, it is important to note what Recast is not . Recast is not designed to necessarilychange, explain, or interpret anything regarding “real” toxicity as far as it even can be directlyor objectively identified. Rather Recast allows users’ access to, interpretation of, and interactionwith, toxicity detection models . As we have established, the current state of the art of NLP can belinguistically naïve [11], thus we expect that notions of “true” toxicity and detected toxicity willdiverge, especially in the scenario when users try to edit toxic language to comply with thesemodels. Furthermore it is also important to note that the contribution of Recast is not novel NLPor visualization methods (in fact the Recast architecture is model agnostic), but rather in thesynthesis of existing tools in addressing specific user issues and then studying the effect that thesystems underlying these issues have on online discourse and user experience. These considerations are in addition to standard practice considerations with anonymous data collection and annotations.This research study has been approved by the Institutional Review Board (IRB) at the researchers’ institution.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:9

Explanation Method Human Annotation Overlap Average Compute TimeIntegrated Gradient Based . ± .

06 1880 msAttention Based . ± .

07 99 ms

Table 1. Comparing gradient and attention based methods for flagging toxic words in Recast. Evaluationswere run a computer with an Intel i7 2600K and a single NVIDIA GTX 1070.

The primary information Recast expresses is the toxicity classification score of user input. Userscan input text and view the toxicity of the overall sentence, along with which words contribute mostto the output score. Recast displays a score between 0 and 100, which represents the probabilitythat the model assigns to the input for whether or not it is toxic. This is shown at the very top ofthe tool as a bar (Figure 1A). The minimal bar design, in monochrome and removed somewhatfrom the text, is noticeable enough to provide informational feedback, while subdued enough tonot distract from the text itself. The toxicity bar dynamically updates as the user edits the text.This allows users to experiment with their own or suggested edits and get real time feedback forany counterfactual scenarios. This enables both effective exploration to choose the best possiblewording and easy iteration to test hypothesis for how the model works. The inclusion of thismetric visualization may incentivize a local optimization or ‘hill-climbing’ approach. However thisapproach better aligns with the user desire for limited intervention, as a user may be more likelyto make an edit if it is a small change to their text rather than complete rewrite. Furthermore thereactive visual provides immediate feedback, facilitating experimentation which has shown to beeffective in building understanding [27].

A key component of Recast involves identifying the tokens in a text that are indicative of toxicity,and attributing importance to these tokens. There are two widely-used automated techniques thatperform attribution: gradient based explanation and attention based explanation [49, 55]. Both thesetechniques offer a numeric value for each word in an input sentence, where the magnitude of thenumeric value corresponds to relative importance in a model’s prediction. However, the capacityfor these methods to explain model predictions may differ across tasks [58]. In this subsection,we compare gradient and attention based explanations to human annotations and to each-other,selecting an appropriate technique for Recast’s backend.Although there is much debate as to whether attention is a good proxy for explanation [58],when interpreted carefully, attention can be a rough and weak proxy for explanation [9, 43]. As aprecaution, we compared our attention based metric (denoted as 𝑎𝑡𝑡𝑛 ) to typical gradient/saliencybased techniques (denoted as 𝑔𝑟𝑎𝑑 ). For 𝑎𝑡𝑡𝑛 , we computed the average attention score over lastlayer heads in our Transformer model (before the linear classification layers) using the end/CLStoken on the input text. On the other hand, 𝑔𝑟𝑎𝑑 was calculated using an integrated gradientapproach documented by Sundararajan et al. [49], using the standard gradient operation on theinput to the model to evaluate importance. For evaluation purposes, we collected two differentmetrics: speed of inference, and set overlap (overlap ( 𝑋, 𝑌 ) = | 𝑋 ∩ 𝑌 | min ( | 𝑋 | , | 𝑌 |) ) for flagged words insentences. In the set overlap metric, 𝑋 and 𝑌 are sets of flagged words from a given sentence.To effectively compare both our techniques, we aligned differing output ranges for 𝑔𝑟𝑎𝑑 and 𝑎𝑡𝑡𝑛 , since 𝑎𝑡𝑡𝑛 is bounded between [ , ] and 𝑔𝑟𝑎𝑑 between [ , ∞] . We manually tuned cutoffsfor attention (i.e. 𝑎𝑡𝑡𝑛 > .

2) and found cutoffs for gradient based approaches by collecting thedistribution of 𝑎𝑡𝑡𝑛 scores, 𝑝 ( 𝑥 ) , and 𝑔𝑟𝑎𝑑 scores, 𝑞 ( 𝑥 ) , over a random 5% subset of our training Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. dataset (7979 instances). Then we took the percentile value of our . 𝑎𝑡𝑡𝑛 cutoff at 𝑃 − ( . ) = . 𝑞 − ( 𝑃 − ( . )) = .

02. Finally, we selected words from a smaller subset (50 instances) tocompare attention and gradient based flagging to human annotations.We conducted two analyses to compare gradient and attention based methods for explainability:(1) Analyze word overlap and inference speed on a random subset of our training data (7979instances), comparing 𝑔𝑟𝑎𝑑 and 𝑎𝑡𝑡𝑛 .(2) Analyze word overlap on a smaller random subset of our training data (50 instances), com-paring 𝑔𝑟𝑎𝑑 , 𝑎𝑡𝑡𝑛 , and human annotations. The guidelines for this task required annotatorsto flag all words contributing to toxicity either implicitly or explicitly. For analysis 1, we found that the average 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 ( 𝑔𝑟𝑎𝑑, 𝑎𝑡𝑡𝑛 ) = . ± .

02 at a 95% confidenceinterval, for 7979 instances. For inference times, saliency methods required significantly moretime due to the added back-propagation step. We recorded flagging speed per batch, with 𝑔𝑟𝑎𝑑 at 1.88 s ± ± std. dev. of 7 runs, 1 loop each), and 𝑎𝑡𝑡𝑛 at 98.8 ms ± ± std. dev. of 7 runs, 10 loops each). For analysis 2, the authors who are familiarwith the task setup annotated 50 random examples manually, highlighting tokens they consideredtoxic. We recorded average 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 ( 𝑔𝑟𝑎𝑑, 𝑎𝑡𝑡𝑛 ) = . ± . 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 ( 𝑔𝑟𝑎𝑑, ℎ𝑢𝑚𝑎𝑛 ) = . ± .

06, and 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 ( 𝑎𝑡𝑡𝑛, ℎ𝑢𝑚𝑎𝑛 ) = . ± .

07 – all values are at 95% confidence intervals. Regardless of flaggingtechnique, we find that each method highlights core toxic elements in text that align with humanannotations.In order to achieve real-time explanations with minimal latency, gradient explanations are pro-hibitively slow (1 . 𝑠 vs 98 . 𝑚𝑠 ). In our token flagging task, both attention and gradient techniquesperform similarly, and have reasonable overlap with human annotations. Therefore, we use theattention to explain which words are highly associated with our model’s predictions. We impor-tantly understand that attention, in some contexts, may not explain model prediction. However,in our specific scenario (flagging toxic phrases), attention is both significantly faster and flagssimilar tokens when compared with human annotation (overlap at .

82 and .

87 for both task 1 and2, respectively). Because our selected techniques perform similarly, and 𝑎𝑡𝑡𝑛 provides improvedinference speeds, we utilize 𝑎𝑡𝑡𝑛 for this work.

Various visualization concepts have been used to show the relative importance of words andvisualize attention, such as highlighting and opacity [54]. However, we utilized an underline onevery word, where the opacity of each underline would be controlled by the magnitude of attentionplaced on each word (Figure 1D). This method was chosen as it helped with legibility of the text,which is vital for users understanding differences in textual classifications. Like the classificationbar, its design is purposefully simple, mirroring the text interaction techniques of underlining tonote where editing is required—something end users are familiar with from common text editingsoftware. This accessibility allows Recast to effectively communicate the complexities of toxicitydetection models using a visual language users are already fluent in.

Beyond passive visualizations, Recast suggests concrete, actionable edits to text, assisting usersin lowering toxicity through alternative wording. The alternative wording feature provides userswith options to swap or delete words in a sentence that are responsible for high toxicity scores. Aword in the input is highlighted (Figure 1B) to draw particular attention to it when it meets thecriteria of:

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:11

Fig. 2. Process for generating alternative words (1) An attention score greater that 0.2(2) Recast can find alternative words with individual toxicity less than 0.4(3) The alternatives have a positive impact on the overall input toxicity.These thresholds were determined during the analysis to best match the attention highlights tohuman annotation. The requirements of alternatives having a positive impact both globally (for thewhole text) and locally (for the replacement word on its own to be benign) provides a safeguardagainst malicious use.When the user hovers over any of these words, suggested substitutions are shown and rankedin a popup (Figure 1F). Selecting one of these alternatives replaces the word and the new toxicityscore is updated. This mode of interaction is also easy and intuitive for users due to its similarity tofamiliar spellcheck or thesaurus tools (motivated by our survey), requires little retyping of edits,and gives options if users cannot immediately think of an alternative word. Furthermore, it displaysa range of options which gives the end user agency in maintaining the original meaning as closelyas possible. Finally, beyond the act of making the sentence less toxic, the technique allows users tolearn which words tend to be highlighted, and what common synonyms the algorithm tends tosuggest. This allows people to learn about the model and use this knowledge while writing futurecomments.Figure 2 illustrates how the set of alternative words for a given toxic input word is calculated.Given a word for which we are calculating potential alternatives we first find its nearest neighborsin a Glove [20] word embedding space [36]. We limit nearest neighbor search to 10, balancing theamount of user choice in word options while reducing the cognitive load of choosing among toomany [53]. These words will match the original word closely in meaning, and provide a solid baseof options for users to reword from. Furthermore these vectors are not dependent on a manuallycurated list of synonyms and extends to find similar but non-synonym words. Next we use a BERTlanguage model to find other words that may fit within the context of the word to be replaced.We do this by feeding the original input into the language model with the selected word masked,causing the language model to output a probability distribution of likely words that fit within thatcontext. From these words, we select the 20 most likely. Then, we take the union of these two setsand filter out any words with individual toxicity greater than 0.4, along with words that do nothave a positive impact on the overall input toxicity.Using current state of the art NLP models does not ensure that every option will be a goodreplacement. To this end, Recast highlights several alternatives so the user will likely find at leastone good replacement that they can select. Our controls ensure that no replacement makes the

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. result worse. This gives the user the most amount of control over the process while still leveragingall of the potential options and power of modern NLP systems.

In addition to replacing single words, sometimes toxic words come in groups or phrases where thetoxicity is not individually attributable to any one of the words, this may require a different editingparadigm for the end user. To support this, Recast allows a user to select a contiguous text phrase,and alternatives are generated for the n-gram of toxic words contained within that phrase. Setsof words are chosen from the same universe of words as for single word replacements throughword embedding similarity. Furthermore, the language model masks the entire set of words to bereplaced and gets the most likely tuples from the resulting joint distribution. This allows Recast toencode the linguistic coherence of not just each word in context but the whole set of words withincontext. Finally sets of words are ranked by the resultant toxicity of the edit on the selection asbefore.

Knowing that the classification model is expected to make mistakes, and that a goal is to provideuser recourse for handling those mistakes, we have also included an integrated feedback formwithin the tool. If a user feels the model has made a mistake, there is an included text box belowthe main input space for comments to submit to the developers. In a deployed system, this wouldforward complaints on to the relevant platform. This can be used as a means for re-training andimproving the model as well as providing a direct way for users to pressure platforms when modelsexhibits bias or other issues. By logging inaccuracies highlighted using Recast, end users canconcretely identify when models utilize tokens that should not be attributed with a toxic prediction.Furthermore, visual feedback from Recast provides developers and researchers with identifiablesources of errors in their models.

While Recast as a tool is built to be model agnostic (as long as a model uses attention or similarmethod), for our evaluation we needed to include a backend implementation of current state of theart toxicity detection models as a useful proxy for deployed systems.

The dataset we used for training the backend model for Recast was sourced by theKaggle competition run by Google’s Jigsaw [25], which is based on the dataset used by Google’sPerspective API [30]. The Perspective API is used as one of the most commonly used and openlyavailable content moderation tools. Therefore, its underlying dataset was a suitable proxy forRecast’s goal. By benchmarking against this dataset, we can compare our performance directlyagainst that of the Perspective API, and thus be well justified in the representativeness of ourmodel.We also chose to use this dataset because it was pre-cleaned, openly available, and contains awide variety of baseline models through the Kaggle competition. The dataset consists of a set of312735 comments from Wikipedia’s talk page edits, along with multiple labels that characterise theform of toxicity (toxic, severe toxic, obscene, threat, insult, and identity hate). We chose to only usethe toxic label in the dataset for modeling purposes, as the other labels were subsets of toxicity andwe wanted a sharper focus.A noteworthy limitation of this dataset is its focus on explicit hate speech, or speech that directlyinsults through the use of particular keywords and phrases (like “shut up” and “stupid,” as seen inFigure 1). Implicit hate speech, however, tends to focus on stereotypes, avoiding explicit phrases (e.g.,

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:13

Model ROC-AUCLogistic Regression 0.963Naïve-Bayes SVM 0.972LSTM 0.977

Fine Tuned BERT (Recast) 0.982

Large Ensemble (Kaggle Leader) 0.989

Table 2. Kaggle Reported Toxicity Detection Performance. Highlighting the fine tuned BERT model used inthis work. “you’re smart for a girl” containing no individually toxic components yet expressing a misogynistmeaning). Future work on collecting implicit hate speech is needed to help extend Recast andother toxicity detection systems to support such examples.

To detect toxicity in text, we fine-tuned a state-of-the-art Transformerbased model (BERT) that performs reasonably well across various language modeling tasks. Trans-former based models rely extensively on self-attention to predict text [55]. Concretely, self-attentionallows a model to detect toxicity based on the context of a word. Transformer models work byapplying self-attention mechanisms to the input several times, over several layers. A final output isselected by propagating attention across these layers. Although there are a wide range of possiblemodels, such as those proposed in the original 2017 Jigsaw Kaggle competition [25], we decided toutilize Transformer models due to their prevalence in most modern NLP tasks, as Recast aims tobe generally useful for current models.Table 2 summarizes performances across several baseline classifiers using the ROC-AUC met-ric [12], which is a standard metric in machine learning classification problems and the one reportedin the Kaggle leaderboard. The BERT model we used performs on par with the current state of theart Kaggle leader-board by utilizing a single model as opposed to an extremely opaque ensemble ofmodels. Furthermore, it remains representative of the current trends in NLP.

All deep learning based models used in our system were implemented in PyTorch[38], a library for building deep neural networks. We also utilized the HuggingFace packagefor pretrained Transformer models. Although we use Transformers across Recast, we built ourfrontend to be model agnostic, and so the backend model can easily be swapped out without majorcode changes (provided the replacement model supports generating attention-like explanationsfor its predictions). Our frontend was written using Svelte for compartmentalizing our code, andD3.js for miscellaneous visual elements. Because Recast’s predictions occur on the backend, ourclient itself can run on systems with reduced computational power. Recast does not present anynovel NLP architectures, techniques or methods. Instead, it is built upon the current state of the art,with consideration for real time performance. A contribution of this work is the usable tool itselfas an implementation designed for this specific use case, and the insights gained from being able tostudy users interacting with the tool.

We conducted two coordinated evaluations of Recast, as outlined in Figure 3, to study how wellit addressed our design goals as well as to study the effect that user interpretability of toxicitydetection models might have on online discourse.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021.

Fig. 3. Multiple Evaluation Procedure

In order to evaluate how users would use a tool like Recast, we consideredthe task of editing toxic comments on social media. This scenario is representative of a situationa user would be presented with when using a tool like Recast—users would be interested ininteracting with the model and potentially making changes to their original comment only aftertheir commented is detected as toxic. Since one of our insights from the initial study was that userswould prefer a tool to be lightweight, we imagine that Recast would only be deployed in thesekinds of editing situations, where a comment is already recognized as toxic instead of being usedevery time a user writes from scratch. As a result, we present annotators with potentially toxiccomments compared to asking annotators to come up with some on their own. While there may besome difference between users editing provided comments instead of editing their own comments,having a common set of comments for users to edit and thus comparable resulting outputs providesa larger benefit in terms of reproducible analysis.

For this evaluation we conducted a within-subject study which comparedediting comments using Recast to normal editing without any help from Recast. We recruited 50users (with a mean age of 38, consisting of 32 self-identified males and 18 self-identified females, allwithin the United States), from Amazon Mechanical Turk, an online microtasking platform. Userswere paid $1.80 per task (matching the United States federal minimum wage).

The task users were given was to edit a set of provided comments scrapedfrom two sources. The Perspective Kaggle dataset [25], consisting of Wikipedia comments manuallylabeled as toxic, provided a strong source of comments for model training. To supplement our studywith comments outside of the Kaggle dataset, we took replies on Twitter scraped on May 9, 2020,found by looking at top trending hashtags in the US. We collected responses that were hidden belowall other replies, behind the following filter warning: “Show additional replies, including those thatmay contain offensive content.” This mixture of sampled comments allowed us to examine bothexamples within and outside of the training dataset. In the case of Twitter, we provided users withcontextual thread information to help them decide how to edit the comments. For a given editing Specific comment text for each of these cases used in the user study can be found in the appendix.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:15

Survey Question Proportion Agrees 95% CIThe tool is easy to use. 78% ± ± ± Table 3. User evaluations of Recast. Both “Agree” and “Strongly Agree” are included as agreement. task, users were provided a description of the context of the original comment, as well as precedingcomments in the cases where they were available.Users were initially prompted to edit 4 comments using a standard text editor, then shown a videodescribing the features of Recast. Next, users were asked to edit a second set of four commentsusing Recast. In order to compare the resulting texts, we randomized the set of comments providedto the Recast enabled and Recast disabled groups. Finally users were asked to rate on a five-pointLikert scale the degree to which they thought the addition of the tool did or did not help themreduce comment toxicity and understand the mechanics of the toxicity detection model. Table 3shows that strong majorities found the tool to be easy to use and helpful, and provided a goodunderstanding of the model’s heuristics.

To further validate these results and account for positivity bias in theresponses, and to analyze the effectiveness of Recast in user understanding, we asked open-endedquestions about what they learned about the model to gauge the generalizability of the patternsthey learned through the study. We asked: “After using the tool, how would you characterize by whatcriterion language gets labeled as toxic versus benign?” .Many users noticed the tendency of the model to focus more on specific keywords than overallsentiment, as two users noted: “For the most part they get labeled by individual words with a negative connotation.”“I think that it tends to pick keywords that can be considered highly offensive.”

Some users pointed out how keywords extended beyond just directly toxic words but commonco-occurrences as well: “I think that language that is obviously offensive (slurs, etc.) is labeled as toxic, as well aswords that, with a high frequency, occur often close to other offensive words to make uplarger phrases.”

However some users noticed the cultural influence and flaws in which words were highlighted: “I think slang words and curse words are flagged more than negative opinions.”“I think that, especially slang, gets misconstrued within the tool and they falsely label itas toxic, when in reality it’s not.”

One user also noticed the flaws of the underlying model by experimenting themselves with the tool,as they noticed differences between toxic language that the model highlights, and toxic meaningwhich it does not: “I think that ‘language’ is a bad way to determine what’s toxic. You can write terriblethings in cordial proper language, and also be kind in crass harsh language. In many casesI couldn’t make something not toxic without changing the entire premise, as people werejust trying to be rude no matter what.”

They later went on to describe the experimenting they had done:

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. “I tested ‘I love this motherfucker, I’d take a bullet for him any day of the week’ it comesback as 100% toxic. Saying ‘I genuinely hope you just don´t wake up tomorrow’ is 2%.There’s clearly a flaw in the system” . These responses showcases some generic takeaways users gained through usingRecast:(1) Current automated moderation models focus mostly on individual words, not higher levelmeaning.(2) Words that are considered toxic are sometimes influenced by dialect and slang.These findings are consistent with how linguists describe modern NLP behaviour[11], showinghow

Recast enables nontechnical users to quickly understand heuristically how highlycomplicated language classification models work in practice . Furthermore, Recast providesa canvas for easy experimentation that enables users to effectively find flaws in the system andgenerate meaningful specific critiques, empowering users to potentially take action where theyotherwise would not be able to.

In our second evaluation, we compare the resulting edited text provided by users in study 1 toanalyze the effect that editing with or without Recast has on the final comments. This will help usnot only understand the effect of Recast on online discourse, but also study, more generally, howusers might interact and fine tune their language when using an automated toxicity filter.

To do this, we recruited 1,458 participants from Amazon Mechanical Turk andasked them to compare two edited comments—one generated by a participant in study 1 in theRecast-enabled condition, and another generated by a participant in study 1 in the Recast-disabledcondition. These comment pairs were randomly selected from the same prompt for each condition.Three different participants were asked to assess each set of: original comment, Recast-enablededit, and Recast-disabled edit (anonymized as

Edit 1 and

Edit 2 randomly). Participants were askedto rate how much they view each comment version as toxic on a five-point scale, and were alsoasked how well they perceived each edit preserved the general content and intent of the originalcomment.

We found that both the Recast enabled and disabled case were statistically in-distinguishable with respect to maintaining the original general meaning, both 60% ±

2% of thetime. Recast-enabled and Recast-disabled are comparable conditions to look at toxicity, sinceneither is disproportionately changing the input so as to be incomparable. As expected, the originalcomments which had already been labeled as toxic, either within the Wikipedia dataset or by theTwitter content filter, were mostly considered toxic, though a significant minority of these originalcomments were also labeled as not-toxic, highlighting how people perceive toxicity as non-binary.As shown in Figure 5, the edits made with Recast disabled were generally classified as less toxicthan comments made with Recast enabled.Figure 4 showcases the joint distributions of the original comment toxicity label in each of theedit conditions. A closer look at the joint distribution suggests that in the Recast enabled case,the labeled toxicity is more highly correlated with the original toxicity when compared to thedisabled case (Kendall Tau[26] of 𝜏 = .

15 with 𝑝 < . 𝜏 = − .

03 with 𝑝 > . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:17

Better

Edit worse thanOriginal

OriginalComment R ECAST -disabled

Edit R ECAST -enabled

Edit

Less More toxic

Better

Edit worse thanOriginal

OriginalComment

Less More toxic

A B C

Fig. 4. Joint distributions of enabled and disabled edits vs original comment toxicity. Note (A) which showcasesthe upper diagonal of the disabled case where the resulting toxicity is higher than original toxicity. This regionis higher populated than in the enabled case, showing that there is a higher risk of increasing toxicity whenwriting edits without Recast. However (B) showcases that without Recast, even high toxicity comments areoften reduced. Overall the disabled case shows that without a tool the resultant toxicity is independent ofthe original. (C) highlights the opposite effect in the enabled case, where the strong representation alongthe diagonal shows that the resulting toxicity of edits generated using Recast is more likely to be similar tothe original toxicity, which is a benefit in that there are fewer cases where toxicity is increased in the upperdiagonal, but a cost in the lower effectiveness of reducing toxicity in the lower diagonal.Fig. 5. Distribution of toxic labels (either ‘Agree’ or ‘Strongly Agree’ with statement that a comment is toxicfor human annotation, or classification by model) for unedited comments, comments edited without Recast,and comments edited with Recast. Error bars show the 95% binomial proportion confidence interval underthe asymptotic normal approximation. We find that Recast does produce optimal comments according tothe model. However we find that the model systematically under reports toxicity among edited comments,and that model optimized comments are labeled as on average more likely to be toxic by human annotators.

Finally,we looked at the difference between the human annotations of toxicity and model classificationsof these edits. We reclassified the outputs of each edit with our fine tuned toxicity classificationmodel, and compared the resulting classifications to the corresponding human labels. As modelsbecome more impactful gatekeepers of what language is and is not allowed, any heuristic whichworks better for the model will be structurally incentivized and potentially become more common.By examining the difference between the moderation using human determined toxicity and by

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. model determined toxicity, we can potentially hypothesize future directions of online discourse assuch models become more prevalent.In Figure 5 we see that the Recast enabled comments remained classified as toxic 2 . ± . . ± .

6% of thetime. At the same time, edits made using Recast do not reliably decrease the human-annotatedtoxicity of comments, especially for comments with already high toxicity that need the most editing.However, edited comments with originally high toxicity are still classified as toxic by the detectionmodel only a quarter as often as the same comments edited without Recast.

This analysis shows that:(1) Recast was highly effective at helping users reduce toxicity as detected through the model ,but not as effective at reducing human annotated toxicity.(2) Therefore when language is optimized for the model (which is what is implicitly incentivizedby the deployment of these models), the model ceases to be a good judge of toxicity asdetermined by human annotators. Language that is less toxic to the model can be more toxicto humans. “When a measure becomes a target, it ceases to be a good measure”[48].

When discussing the use of Recast as a tool for reducing toxicity, we have shown that there are two,potentially competing, meanings of the task. There is the underlying notion of toxicity as languagewith an adverse effect on people, and there is toxicity as the output of the models used to moderateplatforms. The development of models like the Perspective API are predicated on the idea thatthese two concepts are—if not the same—at least asymptotically close as models improve. However,our work has shown that when users have direct access to the toxicity classification models, theiroptimized language to match the model does not also optimize human labeled toxicity.Recast is clearly effective at allowing users to reduce the model toxicity of their comments,and does this with good ease of use and accessibility, while maintaining the original meaning.Recast also provides a path of least resistance between language that is classified as toxic andlanguage not classified as toxic, and then gives users the power to choose how to use that knowledge.This is useful for a variety of people. Users who may not have an strong fluency of English mayinadvertently say things considered toxic without knowledge, and Recast provides a frictionlessway for these users to make changes and learn what is acceptable.On the other hand, some users are better informed than the model when identifying toxiclanguage. We have highlighted how implicit definitions of toxicity used by models are fundamentallydifferent than what humans consider toxic; toxicity is only a meaningful concept in so far as it hasan effect on people, not computers. As users pointed out in subsubsection 5.1.4, slang or dialect maybe misclassified as toxic. Recast allows these users to both circumvent potentially unjust/biasedmodels, and raise awareness of these issues through explained examples.While Recast does not seem to reduce human labeled toxicity due to users optimizing for themodel output, this further emphasizes the need for user recourse and model oversight in this space.Recast is an initial effort towards this goal. However, we recognize that there is substantial workthat needs to be done from the developer side of these platforms—and potentially on the policyside—in order for the recourse that Recast provides to be impactful.

Recast was also designed to allow exploration and visualization of toxicity models in order toallow users to understand them and provide a degree of transparency. A large majority of study

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:19 participants in our evaluation reported that Recast helped them understand the toxicity model(Table 3), and many users were able to produce insightful comments about the patterns presentedby the model through Recast. This is a significant component of recourse that Recast helpsfacilitate. When toxicity classification models are used to police language, then users are subject tooften biased rules that they may not even know are being applied. Providing actionable recourse ispredicated on those affected by those systems understanding how they are affected, which is nearlyimpossible when the system is a black box algorithm. Recast hopes to break into that black boxand allow end-users and non-experts to then see these rules as they are applied.

As machine learning systems are more often deployed to manage moderation online, users willbe required to tune their language to match the standards set forth by these models. This willoccur regardless of the availability of tools like Recast, as only language meeting these models’standards will be visible or filtered (creating a form of survivorship bias). By introducing Recast asa way to more directly optimize for these models, we can study the long term evolutionary effectof misapplied filters on future online discourse. Our evaluation quantitatively highlights how thestandards required to succeed with a model diverge from human perceived toxicity; as modelbased standards become more prevalent, toxicity according, to human standards, may infact increase . This is a troubling trend in our large scale quantitative evaluation that warrantsfurther study.

The design of the Recast interface is also meant to be generalizable as models improve. Basedon prior research in the toxicity detection space, Recast utilized the state-of-the-art BERT modelto estimate the degree of toxicity in messages and suggest alternative wordings. Despite usinga specific type of toxicity detection model, Recast is agnostic to BERT specifically and can beeasily combined with other machine learning models. Similarly, though the alternative wordingsuggestion component currently relies on Word2Vec, it serves as a generic framework and iscompatible with other embeddings or techniques to generate as broad a space of options for usersas possible. Recast is also agnostic to model explanation techniques, provided explanations arebased on individual words or phrases within a text. Because of our justification in subsection 4.2,we expect gradient-based model explanations to yield similar results due to flagging of similartokens. Finally, Recast will enable future to work validate the effectiveness of new explainabilitytechniques on model assisted intervention for content moderation.With respect to toxicity detection models themselves, our work highlights the numerous chal-lenges associated with automated moderation systems. For example, hateful content may beexpressed in multiple ways, e.g., sarcasm, irony, coded text. Users may even use hateful wordsor phrases to refer to themselves. Instead of investigating different forms of hateful content, wework with a large-scale benchmark corpus with a pre-defined set of toxicity labels. Recast canbe further extended to handle various formats of toxic speech. Furthermore, Recast could allowsusers to interact with models to take various definitions of toxicity to their limits by optimizingtheir language. As such, Recast may provide a useful backbone for the study of different notionsof toxicity.Our evaluation of Recast was mainly conducted on Amazon Mechanical Turk with annotatorsin a lab-like environment. As a result, we could not assess the long-term effect introduced byRecast. Future work could build upon our research to further investigate whether users will belikely to use tools like Recast in their daily interactions on different online platforms, and howRecast’s involvement affects users’ subsequent participation.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021.

This also relates to another area of future work: building out implementations of Recast thatmay run in the browser. We actively made decisions to ensure Recast is light weight and can berun without significant computational resources on the front-end. By making the code open source,we open an avenue for future work to expand the functionality of Recast into a browser extension.This would help both validate the results of the studies we have run by embedding Recast in amore realistic scenario, but also make Recast accessible to the users who may benefit from it.However, we caution future researchers to carefully weigh the ethical implications of widelydeploying Recast, as such functionality may be highly useful for those users working in goodfaith, but potentially harmful if used by bad actors. These risks are inherent in any functionalityhelping users navigate these systems, as explained in subsection 3.3. Our study finds that thereis an important distinction between two notions of toxicity: (1) Language detected by a model astoxic, and (2) Language that has adverse effects on real people. An inherent risk of visual analyticstools is their ability to only optimize for (1), which while we may hope better aligns with (2) in thefuture; we find that currently it does not. Thus while Recast is effective at helping users actingin good faith to reduce (1), its inability to consistently reduce (2) elucidates flaws in current NLPmodels rather than the specific design of Recast. Finally, an important limitation of any tool is thatit requires users to want to lower toxicity; which of course is often not the case. However, explicitlyhandling malicious users is outside the scope of this work, and future work studying when users actmaliciously and how to better design human-AI interfaces to de-escalate toxic behaviour beforesuggesting alternatives—may yield better systems for automated content moderation .

In this work, we identified some key problems with the proliferation of automatic toxicity detectionmodels, and introduced an interactive tool, Recast, to address them. Recast provides users theability to interact with toxicity detection models and visualize how they work. Through theseinteractions, users are able to make actionable changes to their language in order to reduce toxicity,while gaining generalizable insights about toxicity models. Through multiple large scale userevaluations, we showed the effectiveness of Recast in helping users edit text to decrease modeldefined toxicity while providing interpretable explanations to users. At the same time we highlightedthe pitfalls of using toxicity detection models for moderation, as toxicity defined by the modeldiffered from human labeled toxicity. We hope Recast can help users overcome the challenges ofautomatic moderation, and further the study of methods to empower users in navigating platformsincreasingly governed by machine learning systems.

ACKNOWLEDGEMENTS

We would like to thank anonymous reviewers for their helpful comments. We also thank Dr.Joseph Seering for their feedback on this work. This work was supported in part by NSF grantsIIS-1563816, CNS-1704701; DARPA (HR00112030001); gifts from Facebook, Intel, NVIDIA, Bosch,Google, Symantec, Yahoo! Labs, eBay, Amazon.

REFERENCES [1] Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. Man is to ComputerProgrammer as Woman is to Homemaker? Debiasing Word Embeddings. arXiv:cs.CL/1607.06520[2] Catherine Buni and Soraya Chemaly. 2016. The secret rules of the internet. [3] Pete Burnap and Matthew L Williams. 2015. Cyber hate speech on twitter: An application of machine classificationand statistical modeling for policy and decision making.

Policy & Internet

7, 2 (2015), 223–242.[4] Jie Cai and Donghee Yvette Wohn. 2019. What Are Effective Strategies of Handling Harassment on Twitch? Users’Perspectives. In

Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:21

Computing (Austin, TX, USA) (CSCW ’19) . Association for Computing Machinery, New York, NY, USA, 166–170. https://doi.org/10.1145/3311957.3359478 [5] Stevie Chancellor, Jessica Annette Pater, Trustin Clear, Eric Gilbert, and Munmun De Choudhury. 2016.

Proceedings of the 19thACM Conference on Computer-Supported Cooperative Work & Social Computing (San Francisco, California, USA) (CSCW’16) . Association for Computing Machinery, New York, NY, USA, 1201–1213. https://doi.org/10.1145/2818048.2819963 [6] Eshwar Chandrasekharan, Chaitrali Gandhi, Matthew Wortley Mustelier, and Eric Gilbert. 2019. Crossmod: A Cross-Community Learning-Based System to Assist Reddit Moderators.

Proc. ACM Hum.-Comput. Interact.

3, CSCW, Article174 (Nov. 2019), 30 pages. https://doi.org/10.1145/3359276 [7] Despoina Chatzakou, Nicolas Kourtellis, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, and AthenaVakali. 2017. Mean Birds: Detecting Aggression and Bullying on Twitter. arXiv:cs.CY/1702.06877[8] Danielle Keats Citron and Helen Norton. 2011. Intermediaries and hate speech: Fostering digital citizenship for ourinformation age.

BUL Rev.

91 (2011), 1435.[9] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look at? AnAnalysis of BERT’s Attention.

Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting NeuralNetworks for NLP (2019). https://doi.org/10.18653/v1/w19-4828 [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).[11] Allyson Ettinger. 2019. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for languagemodels. arXiv:cs.CL/1907.13528[12] Tom Fawcett. 2006. An introduction to ROC analysis.

Pattern Recognition Letters

27, 8 (2006), 861 – 874. https://doi.org/10.1016/j.patrec.2005.10.010

ROC Analysis in Pattern Recognition.[13] Menczer Filippo, R. Fulper, E. Ferrara, Y. Ahn, A. Flammini, B. Lewis, and K. Rowe. 2015. Misogynistic Language onTwitter and Sexual Violence.[14] R. Stuart Geiger and David Ribes. 2010. The Work of Sustaining Order in Wikipedia: The Banning of a Vandal. In

Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (Savannah, Georgia, USA) (CSCW ’10) .Association for Computing Machinery, New York, NY, USA, 117–126. https://doi.org/10.1145/1718918.1718941 [15] Tarleton Gillespie. 2018.

Custodians of the Internet: platforms, content moderation, and the hidden decisions that shapesocial media . Yale University Press.[16] Fred Hohman, Minsuk Kahng, Robert Pienta, and Duen Horng Chau. 2018. Visual Analytics in Deep Learning: AnInterrogative Survey for the Next Frontiers. arXiv:cs.HC/1801.06889[17] Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. 2019. exBERT: A Visual Analysis Tool to ExploreLearned Representations in Transformers Models. arXiv:cs.CL/1910.05276[18] Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google’s Perspective APIBuilt for Detecting Toxic Comments. arXiv:cs.LG/1702.08138[19] Christoph Hube and Besnik Fetahu. 2019. Neural Based Statement Classification for Biased Language. In

Proceedingsof the Twelfth ACM International Conference on Web Search and Data Mining (Melbourne VIC, Australia) (WSDM ’19) .Association for Computing Machinery, New York, NY, USA, 195–203. https://doi.org/10.1145/3289600.3291018 [20] RichardSocher JeffreyPennington and ChristopherD Manning. 2014. Glove: Global vectors for word representation. In

Conference on Empirical Methods in Natural Language Processing . Citeseer.[21] Shagun Jhaver, Darren Scott Appling, Eric Gilbert, and Amy Bruckman. 2019. " Did You Suspect the Post Would beRemoved?" Understanding User Reactions to Content Removals on Reddit.

Proceedings of the ACM on human-computerinteraction

3, CSCW (2019), 1–33.[22] Shagun Jhaver, Iris Birman, Eric Gilbert, and Amy Bruckman. 2019. Human-Machine Collaboration for ContentRegulation: The Case of Reddit Automoderator.

ACM Trans. Comput.-Hum. Interact.

26, 5, Article 31 (July 2019),35 pages. https://doi.org/10.1145/3338243 [23] Shagun Jhaver, Amy Bruckman, and Eric Gilbert. 2019. Does Transparency in Moderation Really Matter?: UserBehavior After Content Removal Explanations on Reddit.

Proc. ACM Hum.-Comput. Interact.

3, Article 150 (2019). https://doi.org/10.1145/3359252 [24] Shagun Jhaver, Sucheta Ghoshal, Amy Bruckman, and Eric Gilbert. 2018. Online Harassment and Content Moderation:The Case of Blocklists.

ACM Trans. Comput.-Hum. Interact.

25, 2, Article 12 (March 2018), 33 pages. https://doi.org/10.1145/3185593 [25] Kaggle. 2017. Jigsaw Toxicity Dataset. https://tinyurl.com/y3fbco5b [26] M. G. Kendall. 1945. The Treatment Of Ties In Ranking Problems.

Biometrika

33, 3 (1945), 239–251. https://doi.org/10.1093/biomet/33.3.239

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. [27] Vivian Lai, Han Liu, and Chenhao Tan. 2020. " Why is’ Chicago’deceptive?" Towards Building Model-Driven Tutorialsfor Humans. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems . 1–13.[28] Brandon Laughlin, Christopher Collins, Karthik Sankaranarayanan, and Khalil El-Khatib. 2019. A Visual AnalyticsFramework for Adversarial Text Generation. arXiv:cs.HC/1909.11202[29] Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2019. TextBugger: Generating Adversarial Text AgainstReal-world Applications.

Proceedings 2019 Network and Distributed System Security Symposium (2019). https://doi.org/10.14722/ndss.2019.23138 [30] Google LLC. 2017. Perspective API. .[31] Kiel Long, John Vines, Selina Sutton, Phillip Brooker, Tom Feltwell, Ben Kirman, Julie Barnett, and Shaun Lawson.2017. “Could You Define That in Bot Terms”? Requesting, Creating and Using Bots on Reddit. In

Proceedings of the 2017CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17) . Association for ComputingMachinery, New York, NY, USA, 3488–3500. https://doi.org/10.1145/3025453.3025830 [32] Kaitlin Mahar, Amy X Zhang, and David Karger. 2018. Squadbox: A tool to combat email harassment using friendsourcedmoderation. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems . 1–13.[33] Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. 2019. Black is to Criminal as Caucasian isto Police: Detecting and Removing Multiclass Bias in Word Embeddings. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume1 (Long and Short Papers) . Association for Computational Linguistics, Minneapolis, Minnesota, 615–621. https://doi.org/10.18653/v1/N19-1062 [34] Binny Mathew, Punyajoy Saha, Hardik Tharad, Subham Rajgaria, Prajwal Singhania, Suman Kalyan Maity, PawanGoyal, and Animesh Mukherjee. 2019. Thou shalt not hate: Countering online hate speech. In

Proceedings of theInternational AAAI Conference on Web and Social Media , Vol. 13. 369–380.[35] J. Nathan Matias and Merry Mou. 2018. CivilServant: Community-Led Experiments in Platform Governance. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18) .Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3173574.3173583 [36] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Wordsand Phrases and Their Compositionality. In

Proceedings of the 26th International Conference on Neural InformationProcessing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS’13) . Curran Associates Inc., Red Hook, NY, USA, 3111–3119.[37] Cheonbok Park, Inyoup Na, Yongjang Jo, Sungbok Shin, Jaehyo Yoo, Bum Chul Kwon, Jian Zhao, HyungjongNoh, Yeonsoo Lee, and Jaegul Choo. 2019. SANVis: Visual Analytics for Understanding Self-Attention Networks.arXiv:cs.CL/1909.09595[38] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: AnImperative Style, High-Performance Deep Learning Library. In

Advances in Neural Information Processing Systems32 , H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc.,8024–8035.[39] Reid Pryzant, Richard Diehl Martinez, Nathan Dass, Sadao Kurohashi, Dan Jurafsky, and Diyi Yang. 2019. AutomaticallyNeutralizing Subjective Bias in Text. arXiv:cs.CL/1911.09709[40] Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. The Risk of Racial Bias in Hate SpeechDetection. In

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Association forComputational Linguistics, Florence, Italy, 1668–1678. https://doi.org/10.18653/v1/P19-1163 [41] Joseph Seering. 2020. Reconsidering Community Self-Moderation: the Role of Research in Supporting Community-Based Models for Online Content Moderation.

Proc. ACM Hum.-Comput. Interact.

3, CSCW, Article 107 (Oct. 2020),28 pages. https://doi.org/10.1145/3415178 [42] Joseph Seering, Tianmi Fang, Luca Damasco, Mianhong’Cherie’ Chen, Likang Sun, and Geoff Kaufman. 2019. DesigningUser Interface Elements to Improve the Quality and Civility of Discourse in Online Commenting Behaviors. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems . 1–14.[43] Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable?. In

Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics . Association for Computational Linguistics, Florence, Italy, 2931–2951. https://doi.org/10.18653/v1/P19-1282 [44] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2019. How can we fool LIME andSHAP? Adversarial Attacks on Post hoc Explanation Methods. arXiv:cs.LG/1911.02508[45] C. Estelle Smith, Bowen Yu, Anjali Srivastava, Aaron Halfaker, Loren Terveen, and Haiyi Zhu. 2020.

Keeping Communityin the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based Systems . Association for ComputingMachinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3313831.3376783

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:23 [46] Kit Smith. 2019. 53 Incredible Facebook Statistics and Facts. [47] Anna Louise Strachan. 2014. Interventions to counter hate speech.

GSDRC Applied Research Services

23 (2014).[48] M. Strathern. 1997. ‘Improving ratings’: audit in the British University system.

European Review

International Communication Gazette

80, 4 (2018), 385–400.[51] Samuel Hardman Taylor, Dominic DiFranzo, Yoon Hyung Choi, Shruti Sannon, and Natalya N. Bazarova. 2019.Accountability and Empathy by Design: Encouraging Bystander Intervention to Cyberbullying on Social Media.

Proc.ACM Hum.-Comput. Interact.

3, CSCW, Article 118 (Nov. 2019), 26 pages. https://doi.org/10.1145/3359220 [52] Richard H Thaler and Cass R Sunstein. 2009.

Nudge: Improving decisions about health, wealth, and happiness . Penguin.[53] Edward R. Tufte. 2018.

The visual display of quantitative information . Graphics Press.[54] Hernan Valdivieso, Denis Parra, Andres Carvallo, Gabriel Rada, Katrien Verbert, and Tobias Schreck. 2019. Analyzingthe Design Space for Visualizing Neural Attention in Text Classification. https://tinyurl.com/yzt866vb [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. 2017. Attention is all you need. In

Advances in neural information processing systems . 5998–6008.[56] Suresh Venkatasubramanian and Mark Alfano. 2020. The philosophical basis of algorithmic recourse. (2020).[57] James Wexler, Mahima Pushkarna, Tolga Bolukbasi, Martin Wattenberg, Fernanda Viegas, and Jimbo Wilson. 2019.The What-If Tool: Interactive Probing of Machine Learning Models.

IEEE Transactions on Visualization and ComputerGraphics (2019), 1–1. https://doi.org/10.1109/tvcg.2019.2934619 [58] Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. arXiv:cs.CL/1908.04626[59] Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scalable, Reproducible, andTestable Error Analysis. In

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics .Association for Computational Linguistics, Florence, Italy, 747–763. https://doi.org/10.18653/v1/P19-1073 [60] Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, and Chenliang Li. 2019. Adversarial Attacks on Deep LearningModels in Natural Language Processing: A Survey. arXiv:cs.CL/1901.06796Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021.

A APPENDIXA.1 Selected Toxic Comments from Perspective API Dataset

The following comments were used as the representative comments from the Perspective APIKaggle dataset that users edited in our evaluations. In addition we included a short description tohelp users contextualize these comments. However these contexts are artificial as the dataset itselfdoes not contain any actual context information.(1)

Context

Consider you are an experienced editor on wikipedia, and two new users areattempting to change some edits you have made that you know are correct, however theypersist. You are exhausted and lash out, perhaps as this real reply:

Comment “Would you both shut up, you don’t run wikipedia, especially a stupid kid.”(2)

Context

Consider again as an experienced editor you see some incorrect edits attemptingto change a historical page about an obscure medieval battle. They appear to have even themost basic facts about the true battle wrong, and frustrated may reply:

Comment “Hey why you are spreading misconceptions and trying to spread false informa-tion to the people. You not even know who won the battle and who lost. you are the cheapestperson on the earth. Bloody ignorant fool. Check sources before spreading rumors”(3)

Context

In the process of using the website there are a myriad of technical malfunctionspreventing you from doing important work, for which you may have a time constraint andas a result may post in a support thread:

Comment “HELLO HOW DO I GET SOMEBODY TO FIX SHIT AROUND HERE?”(4)

Context

While editing a piece of minutiae on a highly visited page, there is a spirited debateon whether to include an additional sentence which may be seen as redundant to some inexplaining the overall point but to you seems necessary, however every time you add thissentence another user removes it, in the discussion thread in reply to this user you may post:

Comment “It is not redundant. You are redundant.”(5)

Context

Finally in a thread with a fellow editor, who is having similar difficulties as you,you attempt to commiserate with the post:

Comment ’I know how frustrated you are right now. Stupidity in this place has no limits.Someone with a brain cell or two will eventually show up and clean this mess. Meanwhilehang in there.’

A.2 Selected Toxic Comments from Twitter

The following comments were used within the user studies to have examples from outside of theKaggle dataset. We found these by finding the top posts on the most trending topics on twitter inMay, 2020, focusing in particular on Sports and Politics related topics known for high levels oftoxicity, and found replies that had been automatically flagged by Twitter’s own filtering system.Because we had access to the real context of these replies we also included the full thread whenrelevant and described the basic context of the tweet. When displaying a thread the final reply isthe one users were given to edit.(1)

Context

In this thread discussing the fatal shooting of Ahmaud Arbery, and the delay infinding suspects. User A suggests racial bias while User B disagrees, in this thread the finalreply was filtered as toxic and so please edit it from the perspective of User B.

User A

Why did it take too long to arrest two culprits? Cuz they are white and the victim isblack and this is America. So yea ... so infuriated

User B

It’s actually because of the Georgia laws but go on with your fake outcry.

User A "gEoRgiA LaWs" ...as if 2 black men would get away with murdering a white man thatsminding his business. There was clearly no Stand Your Ground defense. Outright execution

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021. ecast 181:25

User B

I guarantee you didn’t do any research before opening your ignorant trap. But go offgirl you definitely doing it(2)

Context

After the announcement of the matchups of the NFL regular season for next year,fan of the Dallas Cowboys, User A, talks trash about their rival the Philadeplhia Eagles, whena fan of the Eagles, User B attempts to reply but is filtered:

User A

User B

Keep dreaming scumbag(3)

Context

As Liverpool Football Club announced their intention to terminate their contract with playerLoris Karius, who is replaced by a different player, Adrián. User A discusses his view thathe liked the play of Karius better, while User B disagrees in a very typical British EnglishDialect that is filtered as potentially toxic.

User A

I can’t be the only one who rates Karius over Adrian?

User B

Lol what rubbish ! Yes I think you’re the only one mate(4)

Context

Elon Musk, billionaire CEO of electric car company Tesla is suing the county where a Teslafactory is located due to local COVID-19 restrictions preventing the factory reopening. UserA is supportive of this move, while User B is critical of it and their response is filtered.

Musk

Tesla is filing a lawsuit against Alameda County immediately. The unelected & ignorant“Interim Health Officer” of Alameda is acting contrary to the Governor, the President, ourConstitutional freedoms & just plain common sense!

User A

Anything we can do to help? Does reaching out to politicians help in any way?

Musk

Yes

User B

Can you get over yourself for five fucking seconds? I’m a fan of your pursuit but youhave been a fucking douchebag during this whole pandemic. Go invent something useful,like an actual engine that could get us to Mars in a reasonable amount of time.

A.3 Study 1

Figure 6 shows an example thread that was taken from Twitter. Users were presented with foursuch threads either from Twitter or Wikipedia before being given access to Recast, and then givenanother four threads after. The order of threads and which comments were provided in each phaseof the study were randomized.

A.4 Study 2

Figure 7 shows the user interface for the labelling task. Users were provided the same context asusers in study 1, and then given an original comment and then two edited versions (correspondingto either an edit with or without Recast) and then asked to rate the toxicity of each edit as well ashow well they preserve the meaning.

Received October 2020; revised January 2021; accepted January 2021

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 181. Publication date: April 2021.

Fig. 6. Example of thread provided in study 1. There is a brief explanation of context, followed by a visualizationof the comment thread culminating in the comment they are meant to edit. This comment then can easily beloaded into the editing environment.Fig. 7. Labeling task in Evaluation 2. Participants are presented the same context provided to participants instudy 1, as well as the original comment, its Recast-enabled Edit and Recast-disabled edit (anonymized as edit 1 and

Edit 2 ). They are then asked to rate the edits as well as the original comment on toxicity, and theedits on their consistency in meaning with the original.). They are then asked to rate the edits as well as the original comment on toxicity, and theedits on their consistency in meaning with the original.