[PDF] From Toxicity in Online Comments to Incivility in American News: Proceed with Caution

Abstract

The ability to quantify incivility online, in news and in congressional debates, is of great interest to political scientists. Computational tools for detecting online incivility for English are now fairly accessible and potentially could be applied more broadly. We test the Jigsaw Perspective API for its ability to detect the degree of incivility on a corpus that we developed, consisting of manual annotations of civility in American news. We demonstrate that toxicity models, as exemplified by Perspective, are inadequate for the analysis of incivility in news. We carry out error analysis that points to the need to develop methods to remove spurious correlations between words often mentioned in the news, especially identity descriptors and incivility. Without such improvements, applying Perspective or similar models on news is likely to lead to wrong conclusions, that are not aligned with the human perception of incivility.

Full PDF

FFrom Toxicity in Online Comments to Incivility in American News:Proceed with Caution

Anushree Hede Oshin Agarwal Linda Lu Diana C. Mutz Ani Nenkova

University of Pennsylvania { anuhede,oagarwal,lulinda,nenkova } @seas.upenn.edu, [email protected] Abstract

The ability to quantify incivility online, innews and in congressional debates, is of greatinterest to political scientists. Computationaltools for detecting online incivility for Englishare now fairly accessible and potentially couldbe applied more broadly. We test the JigsawPerspective API for its ability to detect thedegree of incivility on a corpus that we de-veloped, consisting of manual annotations ofcivility in American news. We demonstratethat toxicity models, as exempliﬁed by Per-spective, are inadequate for the analysis of in-civility in news. We carry out error analysisthat points to the need to develop methods toremove spurious correlations between wordsoften mentioned in the news, especially iden-tity descriptors and incivility. Without suchimprovements, applying Perspective or similarmodels on news is likely to lead to wrong con-clusions, that are not aligned with the humanperception of incivility.

Surveys of public opinion report that most Amer-icans think that the tone and nature of politicaldebate in this country have become more nega-tive and less respectful and that the heated rhetoricby politicians raises the risk for violence (Cen-ter, 2019). These observations motivate the needto study (in)civility in political discourse in allspheres of interaction, including online (Ziegeleet al., 2018; Jaidka et al., 2019), in congressionaldebates (Uslaner, 2000) and as presented in news(Meltzer, 2015; Rowe, 2015). Accurate automatedmeans for coding incivility could facilitate this re-search, and political scientists have already turnedto using off-the-shelf computational tools for study-ing civility (Frimer and Skitka, 2018; Jaidka et al.,2019; Theocharis et al., 2020).Computational tools however, have been devel-oped for different purposes, focusing on detecting language in online forums that violate communitynorms. The goal of these applications is to supporthuman moderators by promptly focusing their at-tention on likely problematic posts. When studyingcivility in political discourse, it is primarily of inter-est to characterize the overall civility of interactionsin a given source (i.e., news programs) or domain(i.e., congressional debates), as an average over aperiod of interest. Applying off-the-shelf tools fortoxicity detection is appealingly convenient, butsuch use has not been validated for any domain,while uses in support of moderation efforts havebeen validated only for online comments.We examine the feasibility of quantifying inci-vility in the news via the Jigsaw Perspective API,which has been trained on over a million onlinecomments rated for toxicity and deployed in sev-eral scenarios to support moderator effort online .We collect human judgments of the (in)civility inone month worth of three American news programs.We show that while people perceive signiﬁcant dif-ferences between the three programs, Perspectivecannot reliably distinguish between the levels ofincivility as manifested in these news sources.We then turn to diagnose the reasons for Perspec-tive’s failure. Incivility is more subtle and nuancedthan toxicity, which includes identity slurs, profan-ity, and threats of violence along other unaccept-able incivility. In the range of civil to borderlinecivil human judgments, Perspective gives noisy pre-dictions that are not indicative of the differencesin civility perceived by people. This ﬁnding alonesuggests that averaging Perspective scores to char-acterize a source is unlikely to yield meaningfulresults. To pinpoint some of the sources of the noisein predictions, we characterize individual words aslikely triggers of errors in Perspective or sub-errortriggers that lead to over-prediction of toxicity. a r X i v : . [ c s . C L ] F e b e discover notable anomalies, where wordsquite typical in neutral news reporting are con-founded with incivility in the news domain. Wealso discover that the mention of many identities,such as Black, gay, Muslim, feminist, etc., trig-gers high incivility predictions. This occurs despitethe fact that Perspective has been modiﬁed speciﬁ-cally to minimize such associations (Dixon et al.,2018a). Our ﬁndings echo results from genderdebiasing of word representations, where bias isremoved as measured by a ﬁxed deﬁnition but re-mains present when probed differently (Gonen andGoldberg, 2019). This common error—treating themention of identity as evidence for incivility—isproblematic when the goal is to analyze Americanpolitical discourse, which is very much marked byus-vs-them identity framing of discussions.These ﬁndings will serve as a basis for futurework in debiasing systems for incivility prediction,while the dataset of incivility in American newswill support computational work on this new task.Our work has implications for researchers of lan-guage technology and political science alike. Forthose developing automated methods for quantify-ing incivility, we pinpoint two aspects that requireimprovement in future work: detecting triggersof civility overprediction and devising methods tomitigate the errors in prediction. We propose anapproach for a data-driven detection of error trig-gers; devising mitigation approaches remain anopen problem. For those seeking to contrast civil-ity in different sources, we provide compelling evi-dence that state-of-the-art automated tools are notappropriate for this task. The data and (in)civilityratings would be of use to both groups as test datafor future models for civility prediction . Incivility detection is a well-established task,though it is not well-standardized, with the degreeand type of incivility varying across datasets.Hate speech, deﬁned as speech that targets socialgroups with the intent to cause harm, is arguablythe most widely studied form of incivility detec-tion, largely due to the practical need to moderateonline discussions. Many Twitter datasets havebeen collected, of racist and sexist tweets (Waseemand Hovy, 2016), of hateful and offensive tweets Available at https://github.com/anushreehede/incivility_in_news (Davidson et al., 2017), and of hateful, abusive, andspam tweets (Founta et al., 2018). Another cate-gory of incivility detection that more closely alignswith our work is toxicity prediction. Hua et al.(2018) collected a dataset for toxicity identiﬁcationin online comments on Wikipedia talk pages. Theydeﬁned toxicity as comments that are rude, disre-spectful, or otherwise likely to make someone leavea discussion. All these datasets are built for socialmedia platforms using either online comments ortweets. We work with American news.To verify if Perspective can reproduce humanjudgments of civility in this domain, we collect acorpus of news segments annotated for civility.

Models trained on the datasets described above as-sociate the presence of certain descriptors of peoplewith incivility Hua et al. (2018). This bias can beexplained by the distribution of words in incivilitydatasets and the fact that systems are not capableof using context to disambiguate between civil anduncivil uses of the word, instead associating theword with the dominant usage in the training data.To mitigate this bias, Jigsaw’s Perspective APIwas updated, and model cards (Mitchell et al.,2019) were released for the system to show howwell the system is able to predict toxicity whencertain identity words are mentioned in a text. Sim-ple templates such as “I am < IDENTITY > ” wereused to measure the toxicity associated with iden-tity words. More recently, many more incorrectassociations with toxicity were discovered. Prab-hakaran et al. (2019) found that Perspective re-turned a higher toxicity score when certain namesare mentioned, and Hutchinson et al. (2020) foundthat this was also the case for words/phrases rep-resenting disability. “I am a blind person” hada signiﬁcantly higher toxicity score than “I am atall person”. We show that when measured withdifferent templates, the bias that was mitigated inPerspective still manifests. Further, we propose away to establish a reference set of words and thenﬁnd words associated with markedly higher toxicitythan the reference. This approach reveals a largerset of words which do not lead to errors but triggeruncommonly elevated predictions of toxicity in thelower ranges of the toxicity scale.Waseem and Hovy (2016) found that the mostcommon words in sexist and racist tweets in theircorpus are “not, sexist, We study the following American programs: PBSNewsHour, MSNBC’s The Rachel Maddow Showand Hannity from FOX News. For brevity, we usethe network to refer to each source: PBS, MSNBC,and FOX, in the following discussions. Thesesources are roughly representative of the politicalspectrum in American politics, with NewsHour be-ing Left-Center and MSNBC and FOX with strongleft and right bias respectively. These are generallyone-hour shows, with about 45 minutes of contentwhen commercial breaks are excluded.The transcripts, which provide speaker labelsand turns, are from February 2019. We take onlythe days when all three shows aired, and we hadtranscripts, for a total of 51 transcripts.We analyze the programs on two levels withthe help of paid research assistants paid $15 perhour. All of them are undergraduate students innon-technical majors at the University of Penn-sylvania. In Pass I, we ask a single rater to readthrough a transcript and identify speaker turns thatappear particularly uncivil or notably civil. Wecharacterize the programs according to the numberof uncivil segments identiﬁed in each transcript. After that, in Pass II, a research assistant chose alarger snippet of about 200 words that includes theselection identiﬁed as notably uncivil (and respec-tively civil), providing more context for the initiallyselected speaker turns. We also selected severalsnippets of similar length at random but not over-lapping with the civil and uncivil snippets. Each For this ﬁrst analysis of incivility, we pre-selected uncivilsegments, to ensure that they are well represented in the data.Incivility is rare, so a random sample of snippets will containconsiderably fewer clearly uncivil snippets. We are currentlyaugmenting the dataset with data from later months, withsegments for annotation drawn from randomly selected time-stamps, giving a more representative sample of the typicalcontent that appears in the show.

Show U C R Overall Avg Len Vocab

FOX 81 12 23 116 201.0 3960MSNBC 13 23 8 44 209.6 1763PBS 11 31 17 59 216.4 2627Overall 105 66 48 219 206.9 5962

Table 1: Number of Pass II snippets, separated by showand Pass I class: U(ncivil), C(ivil) and R(andom), alongwith average length of the snippets in words and thesize of the vocabulary (unique words).Figure 1: Annotation interface of these snippets, 219 in total, was then rated forperceived civility by two raters, neither of whomparticipated in the initial selection of content. Theraters did not know why a snippet that they wererating was selected. We choose to annotate suchsnippets, corresponding to about a minute of showcontent, to ensure sufﬁcient context is availableto make meaningful judgments about the overalltone of presentation. Table 1 gives an overviewof number of Uncivil, Civil and Random speakerturns around which longer snippets were selectedfor ﬁne-grained annotation of incivility. The largestnumber of civil turns was found in PBS and mostuncivil turns were identiﬁed in FOX.Snippets were presented for annotation inbatches of 15. Each batch had a mix of snippetsfrom each of the programs, displayed in randomorder. The annotators read each snippet and ratedit on a 10 point scale in four dimensions used inprior work to assess civility in news (Mutz andReeves, 2005): Polite/Rude, Friendly/Hostile, Co-operative/Quarrelsome and Calm/Agitated. Thedimensions appeared in random order and alter-nated which end of the dimension appeared on theleft and which on the right, prompting raters to bemore thoughtful in their selection and making itdifﬁcult to simply select the same value for all. Ascreenshot of the interface is shown in Figure 1.The composite civility score is then obtained byﬁrst reversing the ratings to account for the alter-ations of the ends for each dimension, such thatall ratings result in small ratings for the civil endsof the scale (Polite, Friendly, Cooperative, Calm)and high values for the uncivil ends of the scales.The four scores were averaged for each annotator.Finally, the scores for the two annotators are av-eraged to obtain a civility score for each snippet,ranging between 1 and 10, where 1 is the most civiland 10 is the most uncivil possible.

Here we brieﬂy discuss the annotator agreementon the perceived incivility. We characterize theagreement on the transcript level in terms of thecivil and uncivil speaker turns ﬂagged in Pass I andon the text snippet level in terms of correlationsand absolute difference in the scores assigned by apair of raters in Pass II.Pass I selection of turns was made by one person,but we are still able to glean some insights aboutthe validity of their selection from analyzing theratings of Pass II snippets. The Pass I selection canbe construed as a justiﬁcation for the score assignedin Pass II, similar to other tasks in which a rationalefor a prediction has to be provided (DeYoung et al.,2020). Figure 2 shows the distribution of scores forthe 200-word snippets that were selected around theinitial 40-50-word speaker turns deemed notablycivil, uncivil or that were randomly chosen. Thedistribution is as expected, with segments includinguncivil turns almost uniformly rated as uncivil, withcivility score greater than 5. According to our scale,a score of 5 would correspond to borderline civilon all dimensions or highly uncivil on at least one.Only three of the 105 snippets selected around anuncivil turn got scores a bit under 5: the remainingsnippets including a rationale for incivility, wererated as uncivil with high consistency by the inde-pendent raters in Pass II.In Pass II, each pair of annotators rated a total of37 or 36 segments. Correlations between the com-posite ratings are overall high, ranging from 0.86 to0.65 per pair. For only one of the six pairs, the cor-relation is below 0.75. The absolute difference incomposite incivility scores by the two independentannotators is about 1 point.Overall, the consistency between initial selectionfrom the transcript and the independent ratings ofcivility, as well as the correlation between civilityratings of two raters for the snippets, are very good.

Human ratings C oun t s Civil (C) Uncivil (U) Random (R)

Figure 2: Distribution of snippet civility ratings (ob-tained in Pass II) by rationale type (obtained in Pass I).The snippets deemed to contain uncivil during Pass Iare consistently rated as highly uncivil in Pass II.

Human ratings P e r s pe c t i v e sc o r e s FOX MSNBC PBS

Figure 3: Scatterplot of human and Perspective scores

With the corpus of real-valued civility in the news,we can now compare sources according to the per-ception of people with those provided by Perspec-tive. We perform the analysis on the transcriptlevel and the snippet level. On the transcript level,we use the number of uncivil utterances marked inPass I as the indicator of incivility. For the snippetlevel, we use the average composite civility ratingsby the raters in Pass II. For automatic analysis, weobtain Perspective scores for each speaker turn asmarked in the transcript and count the number ofturns predicted as having toxicity 0.5 or greater.For snippets, we obtain Perspective scores directly,combining multiple speaker turns.Figure 3 gives the scatter plot of civility scoresper segment, from raters and Perspective. The plotreveals that Perspective is not sensitive enough todetect any differences in levels of incivility for hu-man rating lower than six. For the higher levels ofincivility, Perspective scores also increase and havebetter differentiation. However, the snippets fromMSNBC rated as uncivil by people receive lowcores. We veriﬁed that these segments are indeeduncivil, but in a sarcastic, indirect way. Portions ofthe turns can be seen in Table 2.There is only one snippet with Perspective toxi-city score over 0.5 for the civil to borderline civilsegments from the news shows; this indicates it hasgood precision for binary identiﬁcation of civil con-tent. Perspective’s performance in detecting binaryincivility (for snippets with ratings greater than 5)is mixed. The recall for incivility is not that good,with some of these snippets receiving low toxicityscores. The trend of increasing Perspective scorewith increasing human-rated incivility is observedmostly on segments from FOX. The incivility inFOX appears to be more similar to that seen inonline forums, with name-calling and labeling ofindividuals. Some examples can be seen in Table 3.The more subtle, snarky comments from MSNBCare not detected as toxic by Perspective.However, when characterizing shows by incivil-ity, detecting utterances that may be problematicis not of much interest. The goal is to character-ize the show (daily, weekly, monthly) overall. Forthis, we inspect the average incivility per show onthe transcript and segment level (see Table 4). Onboth granularities, people perceive a statisticallysigniﬁcant difference in civility between each pairof shows, with FOX perceived as the most uncivil,followed by MSNBC and PBS as the most civil.On the transcript level, the presence of incivilityin PBS is not statistically signiﬁcant. The raterschose 0.29 (fewer than one) uncivil utterances fromPBS from all shows on the 17 days we study, com-pared with 4.5 per show for FOX. The 95% conﬁ-dence interval for the mean for uncivil utterancesper show covers zero for PBS, so it is not statisti-cally signiﬁcant. The lower end of the 95% con-ﬁdence intervals of the mean transcript-level inci-vility in FOX and MSNBC is greater than zero,indicating consistent incivility in these programs.The segment-level analysis of civility ratingsreveals the same trend, with a one-point differencebetween PBS and MSNBC and two points betweenMSNBC and FOX. All of these differences arestatistically signiﬁcant at the 0.01 level, accordingto a two-sided Mann-Whitney test.The automated analyses paint a different pic-ture. On the transcript level, FOX is overall themost uncivil, with about 6 speaker turns per pre-dicted to be toxic, with a Perspective score greaterthan 0.5. PBS appears to be the next in levels of incivility, with more than one toxic turn per tran-script. For both of these shows, incivility is over-predicted, and many of the segments predicted asuncivil are civil according to the human ratings.MSNBC is predicted to have fewer than one toxicturns per transcript, under-detecting incivility. Onthe segment level, FOX is again assessed as themost uncivil, and PBS again appears to be moreuncivil than MSNBC. On the segment level, thedifferences are statistically signiﬁcant. Perspectiveincorrectly characterizes PBS as signiﬁcantly moreuncivil than MSNBC.The correlation between human and Perspectiveincivility scores is 0.29 on the transcript and 0.51on the segment level. Overall for broadcast news,Perspective cannot reproduce the incivility percep-tion of people. In addition to the inability to detectsarcasm/snark, there seems to be a problem withover-prediction of the incivility in PBS and FOX.In the next section, we seek to establish some ofthe drivers of over-prediction errors, characterizingindividual words as possible triggers of absolute orrelative over-prediction of incivility.

Prior work has drawn attention to the fact that cer-tain words describing people are incorrectly inter-preted as triggers of incivility by Perspective, lead-ing to errors in which a perfectly acceptable textsegment containing the words would be predictedas toxic (Dixon et al., 2018b; Hutchinson et al.,2020). Their analysis, similar to other work on biasin word representations, starts with a small list ofabout 50 words to be analyzed in an attempt to ﬁndtoxicity over-prediction triggers.In our work, we seek to apply the same reason-ing with the same goals, but in a more data-drivenmanner, without having to commit to a very smalllist of words for analysis. Given our text domain ofinterest (news) and the desiderata to characterizesources rather than individual text segments, wealso ﬁnd sub-errors, or words that do not lead toerrors in toxicity prediction but have much higherthan average toxicity associated with them com-pared to other words.Ideally, we would like to test the full vocabu-lary of a new domain for (sub-)error triggers, butmethods for doing so do not exist and may not bepractical when the vocabulary is too large. Forthis reason, we sample words in a way informedby the target application, choosing words that con- nippets from MSNBC Human Perspective ... They say that he is also going to declare the national emergency, but with this presidency, honestly, is that actually going tohappen? We don’t know. The White House says it’s going to happen, but in this presidency, is anything really done and dustedbefore Fox & Friends says it’s done and dusted? ... Well, he better take a few moments, I know he doesn’t read, but perhaps someone could read Article I Section 7 and 8 of theConstitution. The power of appropriation lies with the Congress, not with the president. If he were trying to do this, he isbasically establishing an imperial president, his majesty... proposing ﬁreworks for the Fourth of July. Even in D.C., it’s a bold idea from the president today. Presumably this will befollowed by an executive order proclaiming that from here on out, we’re going to start a whole new calendar year every yearon the ﬁrst day of January. What? Also, he’s going to declare that we’re going to start a new American pastime, a ball gamewhere one person holds a stick and that person runs around a series of bases to come home if the person with the stick hits theball well enough and hard enough, a lot of people try to catch it and prevent the batter from rounding the bases and get home.The president will soon announce a name for that and announce that he has invented this game. Also, he’s invented rap musicand the idea of taking a vacation in the summer if you’re a school kid. I mean, I kid you not, the president proposed toreporters from the White House in all seriousness from the White House that he’s thinking there should maybe be ﬁreworksand a parade on the Fourth of July in Washington, D.C. It could catch on. It could become a tradition.

Table 2: Snippets from MSNBC rated as highly uncivil by humans but with low toxicity score from Perspective.Human rating are between 1-10 and Perspective scores between 0-1.

Snippets from FOX Human Perspective

Meanwhile, this garbage deal that I’ve been telling you about, out of Washington, which allocates a measly additional $1.375billion for wall construction – well, that has not been signed by the president. As a matter of fact, nobody in the White Househas seen any language in the bill, none. So, let’s be frank. It’s not nearly enough money. Washington lawmakers, the swampcreatures, they have once again let we, the people, the American people down, this is a swamp compromise. Now, let me makepredictions. Based on what the president is saying publicly, I love the press love to speculate, how did Hannity get hisinformation? All right. We know this is a president that keeps his promises. And he goes at the speed of Trump. And if youwatch every speech, if you actually listen to Donald Trump’s words, you know, especially you people in the hate Trump media,you psychos, he telegraphs. He’s saying exactly what he’s planning to do. You’re too stupid to listen.

The Democratic party with no plan to make your life better. More safe, more secure, more prosperous for not only you but foryour children and you grandchildren. They are fueled too by a rage, hate and obsession with victimhood and identity politics,entitlement, pushing doom and gloom day and night, climate hysteria, indoctrinating your kids into believing the world isending in 12 years. Just to push the socialist agenda. A party fueled by what can only be described as psychotic, literallyunhinged, rage and hatred every second, minute and hour of the day against the duly elected president. They are not eventrying to hide it. Just today we learned that shifty Adam Schiff, former MSNBC legal analyst, collusion conspiracy theoristDaniel Goldman to leave his new witch hunt against the President.

Table 3: Examples of snippets from Fox, rated with human incivility ratings and Perspective.

Show Count Human PerspectiveAvg 95% CI Avg 95% CITranscript Level

FOX 17 4.53 [2.62, 6.44] 6.18 [3.86, 8.49]MSNBC 17 1.24 [0.40, 2.08] 0.29 [-0.01, 0.6]PBS 17 0.29 [-0.06, 0.65] 1.41 [0.6, 2.23]

Snippet Level

FOX 116 7.09 [6.85, 7.33] 0.33 [0.3, 0.37]MSNBC 44 4.97 [4.43, 6.00] 0.15 [0.12, 0.17]PBS 59 3.87 [3.56, 4.18] 0.17 [0.14, 0.19]

Table 4: Statistics for human incivility ratings and Per-spective scores. Count is the number of transcripts andthe number of snippets in each level of analysis. tribute the most to the average incivility score ofone on the news sources compared to another of thesources. We sample a thousand words from eachshow. Occasionally, the same word is sampled formore than one show, so the ﬁnal list for detailedanalysis consists of 2,671 words.

Then we proceed similarly to prior work, to ﬁllshort templates in which only one word differs, anda ﬁxed slot is ﬁlled in with each word in turn. Weuse ﬁve templates, chosen in an ad-hoc way similarto prior work. We do not reuse templates from prior work because they are tailored for the analysis ofidentity words, while we would like any word fromthe vocabulary to be an acceptable completion ofthe template. The average toxicity of templatesﬁlled in with a given word now provides us witha word-speciﬁc incivility score by which we cancompare the full set of selected words.We ﬁll in ﬁve short templates with each wordin turn, to obtain a comparable incivility score foreach word. The ﬁrst two were chosen to be neu-tral; the third one contained a word with a possiblynegative connotation but used in a neutral sense,the fourth contains negation, and the last one is aquestion, to account for a wider variety of possiblecontexts in which the incivility triggers may occur.The ﬁve templates we use are: (i)

We wrote WORD on the page. (ii)

They whispered WORD to themselves. (iii)

I erased the word WORD from the page. (iv)

You didn’t write WORD. (v)

Did he say WORD?We use Perspective to predict the toxicity of eachtemplate when WORD is substituted with each ofthe words in our list for analysis. The averageand maximum of these toxicities serve as a usefulharacterization of the words.The templates were chosen so that the full sen-tence is not offensive, that is neither of the exam-ples, even when ﬁlling it with an offensive wordshould be interpreted as deliberately rude or uncivilwithout any additional context. Higher values ofpredicted incivility would implicate the word sub-stituted in the placeholder as the perceived triggerof incivility. We wish to ﬁnd words where thisassociation is wrong.

At the same time, without reference or knowl-edge of the template word-scores, we classiﬁed thewords as offensive or not. Two of the authors clas-siﬁed words independently and then adjudicatedthe disagreements in a conversation. Ultimately,there were 65 words we judged as offensive out ofthe full list of 2,671 words.Separating offensive words from the rest is help-ful. Using these words, we can get a sense ofthe degree to which Perspective incorporates con-text to make the prediction and then exclude themfrom further analysis. When templates are ﬁlled inwith offensive words, the templates present hardcases for prediction. For these cases, context inter-pretation is necessary to make the correct predic-tion; word-spotting without incorporating contextis likely to lead to an error. The ability to do thisis important for our application: in the news, it isacceptable to report on someone else’s incivilityand their uncivil statements. If this is done with thepurpose of expressing disapproval of the incivility,the reporting itself is not uncivil.Furthermore, the set of offensive words allowsus to quantify the degree to which the templatescores justify our choice of rules to classify non-offensive words as error and sub-error triggers. Wewill consider words to be error triggers if at leastone template was judged by Perspective to have tox-icity greater than 0.5. Sub-error triggers are wordsfor which all ﬁve templates had toxicity lower than0.5, but their average toxicity was markedly higherthan that for other words.The average template toxicity for offensivewords is 0.48, compared to 0.11 for the 2,606 non-offensive words in the list we analyzed. Of the65 offensive words, 54% had at least one templatewith toxicity greater than 0.5. Perspective clearlyis not simply word-spotting to make the prediction.It produces toxicity scores below 0.5 for about half of the offensive words. For the other half, however,it often produces an error.In addition, 35% of the offensive words met thecriteria for sub-error trigger. Overall, 89% of theoffensive words meet either the error triggers orsub-error triggers criteria, conﬁrming that theseranges of toxicity are the appropriate ones in whichwe should focus our attention in search of wordsthat may have an unusually high association withtoxicity. Example offensive words that met thesub-error criteria are: bozo, cheater, Crazy, crock,deplorables, F-ing, hoax, insane, mad, etc.Other words, which we deemed non-offensive,have a template proﬁle similar to that of the vastmajority of offensive words. They are ones forwhich Perspective over-predicts toxicity.

We consider a word to be an error trigger if atleast one of the templates has a toxicity score of0.5 or greater from Perspective. Below is the fulllist of error trigger words from the news shows.We informally organize them into categories tofacilitate inspection.

Identity

African-Americans, anti-Semites, anti-white, anti-women, BlacKkKlansman, feminism, Feminist, gay, Ho-mophobia, Homophobic, homophobic, Islam, Islamic,Jew, LGBT, Muslim, women

Violence and Sex annihilation, assassinated, beheaded, die,kill, killed, killing, murdered, shooting; intercourse, pe-dophiles, pornography, prostitution, rape, raped, rapist,rapists, sex, sexist, Sexual, sexual, sexually, sodomized

Ambiguity

Dick, dirty, garbage, rats

Informal dopest, farting

Other brain, counter-intelligence, blackface, hypocrisy

Many error triggers are identity words describ-ing people. Many of these identities, like gay andMuslim, were known triggers of incorrect toxicityprediction, and Perspective was specially altered toaddress this problem (Dixon et al., 2018a). Clearly,however, they are still a source of error, and theapproaches for bias mitigation have not been fullysuccessful. As we mentioned in the introduction, asystem that is unstable in terms of its predictionswhen identity words appear in a text is not suitablefor analysis of political discourse, where identitiesare mentioned often and where in many cases, itis of interest to quantify the civility with whichpeople talk about different identity groups.The second large class of error triggers consistsof words related to death and sex. In online com-ments, toxic statements are often threats of sexual , 115-pound, abortion, abuse, abused, abusively, abysmal, accosted, adult, Africa, age-old, aliens, America, Americans, anti-semite, Anti-Trump, anti-Trump,Aryan, assaulted, assaulting, attack, baby, bad, barrier, Basically, beaten, belly, Black, black, Blow, blowing, bomber, bottom, bouncer, British, Brooks, brown,bunch, caliphate, capacity, cares, Catholic, Catholics, chicken, chief, child, children, China, Chinese, chock-full, Christian, church, Clinton, conforming,Content, COOPER, country, Covington, cows, crackers, crawling, creatures, cries, crime, crimes, criminal, CROSSTALK, cruelty, crying, DANIELS, dare,daughter, death, decrepit, defy, dehumanizing, Democrat, Democrats, demonize, denigrate, destroy-, died, Dingell, Dinkins, disrespectful, Donald, doomed,drug, drugs, Drunk, ducking, Duke, dumping, eggs, epidemic, European, evil, exist, explode, exploit, extremist-related, face, fake, Fallon, ﬁrearm, ﬁthy, folks,Foreign, Former, FRANCIS, fraud, fry, gag, gagged, gang, gender, girls, governor, gun, guy, guy’s, guys, handgun, harassed, harboring, hate, hate-, hate-Trump,hatred, head, heads, heartless, Hebrew, Hegel, her, herein, heroin, Hillary, HIV, horrors, hts, human, hush, Hymie, ifs, illness, imperialists, impugning, inaudible,inconclusive, infanticide, infuriating, inhumane, intelligence, interracial, invaders, Iran, Iraqi, ISIS, Islamophobic, Israel, Israelites, jail, Juanita, Kaine, Karine,kid, kids, Klan, Korea, laid, LAUGHTER, lie, lied, lies, life-and-death, life-death, limbs, litig, lying, MADDOW, MAGA-inspired, males, mama, man, men,mental, military, minors, missile, mock, mockery, molested, mouth, muck, N-, n’t, NAACP, nation-state, needless, newscasters, Nonproliferation, nose, nuke,Obama, Obama’s, obscene, obsessions, obsessive-compulsive, obsolete, organ, outrageous, ovations, Oversight, oxygen, p.m., painful, Pakistan, patriarchal,people, person, police, pope, President, president, president’s, pretty, priest, priests, prison, prog, punched, punches, Putin, Putin’s, queer, racial, Racism, racism,Rage, ranted, relations, religion, religious, relitigating, remove, REP., Republican, Republicans, RUSH, Russian, Russians, S, Saudi, savagely, self-confessed,self-deﬁning, self-proclaimed, semitic, she, SHOW, sick, slavery, sleet, slurs, smear, socialist, son, Spartacus, stick, stop, stunning, supporters, supremacist,swamp, Syria, tampering, terror, terrorism, terrorists, thrash, threat, throat, tirade, toddler, TRANSCRIPT, trashing, treasonous, Trump, tumor, U.K., U.S, U.S.,U.S.-backed, undress, unsuccessful, unvetted, upstate, vandalized, Vatican, Venezuela, Venezuelans, videotaped, videotaping, violated, violence, violent,VIRGINIA, virulent, voters, War, war, weird, welcome, Whitaker, White, white, WITH, woman, worse, worth, xenophobic, Yemen, you, your, yourself

Table 5: Sub-error trigger Words. The list comprises many identity words, words with negative connotations,words describing controversial topics, and words related to violence. violence and death. Perhaps this is why wordsbroadly related semantically are associated withtoxicity. These words, however, can appear in newscontexts without any incivility. Reports of violenceat home and abroad are a mainstay of news, as wellas reports of accusations of sexual misconduct andabuse. A system that is not capable of distinguish-ing the context of usage is not going to providereliable quantiﬁcation of incivility in the news.Similarly, the ambiguous error triggers are wordswith more than one meaning, which could be of-fensive but can be used in civil discussion as well.For these, Perspective has to disambiguate basedon the context. All of the templates we used wereneutral, so for these, Perspective makes an error.For example, the name ‘Dick’ is an error trigger.The word indeed has an offensive meaning, but inthe news shows we analyze, all mentions are in thesense of a person’s name.A couple of error-triggers are informal, clashingmore in register than conveying incivility.

Sub-error triggers of incivility are words that arenot offensive and for which Perspective returns atoxicity score below 0.5 for each of the ﬁve tem-plates when the word is plugged in the template.The error triggers we discussed above lead to actualerrors by Perspective for its intended use. The sub-error triggers are in the acceptable level of noise forthe original purpose of Perspective but may intro-duce noise when the goal is to quantify the typicaloverall civility of a source.To establish a reference point for the expectedaverage template civility of non-offensive words,we sample 742 words that appeared in at least 10speaker turns (i.e., were fairly common) and ap- peared in speaker turns in the news shows thatreceived low toxicity predictions from Perspective,below 0.15. These were part of the list we clas-sify as offensive or not. No word in this referencesample was labeled as offensive.The average template score for the referencesample is 0.09, with a standard deviation of 0.01.We deﬁne sub-error triggers to be those whose av-erage template toxicity score is two standard devia-tions higher than the average in the reference list.There are 325 non-offensive words that meet thecriteria for sub-error triggers. They are shown inTable 5. The list is long, which is disconcerting be-cause it is likely that sentences containing multipleof these words will end up triggering errors.A sizeable percentage of sub-error triggers arerelated to identities of people—gender, age, reli-gion, country of origin. Oddly, a number of child-describing words appear in the list (baby, child, kid,toddler). There are also several personal namesin the list, which indicates spurious correlationslearned in Perspective; names by themselves cannot be uncivil.Second person pronouns (you, your) and third-person feminine pronouns (her, she) are sub-errortriggers. The second-person pronouns are likelyspuriously associated with toxicity due to over-representation in direct toxic comments directedto other participants in the conversation. Similarly,the association of female pronouns with toxicity islikely due to the fact that a large fraction of the indi-rect toxic comments online are targeted to women.Regardless of the reasons why these words wereassociated with incivility by Perspective, the vastmajority of them are typical words mentioned inthe news and political discussions, explaining toan extent why Perspective is not able to provide a he other thing that’s important, both sides seem to be mad about this. On the conservative side, you have conservative voices who are saying, this deal is pathetic , it’s an insult to the president. On the Democratic side and on the liberal side, I have activists that are texting me saying, Nancy Pelosi said she wasn’tgoing to give $1 for this wall, and now she is.I do feel safe at school. And I know that sounds kind of ridiculous , since tomorrow makes a year since there was a shooting at my school. But I do feel safe atschool. And I feel safe sending my children to school. I know that there are recommendations that have been made to arm teachers, and I think that is the stupidest thing that I have ever heard in my life. Me having a gun in my classroom wouldn’t have helped me that day.VIRGINIA ROBERTS, Victim: It ended with sexual abuse and intercourse, and then a pat on the back, you have done a really good job, like, you know, thankyou very much, and here’s $200. You know, before you know it, I’m being lent out to politicians and to academics and to people that – royalty and people thatyou just – you would never think, like, how did you get into that position of power in the ﬁrst place, if you’re this disgusting , evil, decrepit person on the inside?In a major incident like this, at ﬁrst there’s, you know, there’s sort of a stunned numb thing that happens. And then you kind of go into this honeymoon phase.There’s just a high level of gratefulness for all of the help that’s coming. And then you get to the phase that we’re kind of beginning to dip in now, which is life sucks right now, and I don’t know how long it’s going to suck .It’s insane . We are throwing away tremendous amounts of food every day. And there are people next door a block away that aren’t getting enough food.

Table 6: Segments from PBS that contain offensive words (marked in boldface).

Category FOX MSNBC PBS Total

Error 197 [.44] 55 [.12] 196 [.44] 448Sub-error 1537 [.39] 723 [.18] 1708 [.43] 3968Offensive 277 [.52] 101 [.19] 156 [.29] 534

Table 7: Number [and fraction] of segments containingat least one (sub-)error trigger or offensive word. meaningful characterization of civility in the news.Table 7 shows the distribution of error triggers,sub-error triggers, and offensive words in the threeprograms. Most of the segments containing er-ror and sub-error triggers are in FOX and PBS;this could explain the observation that incivility ismuch higher in FOX compared to the other two pro-grams when analyzed with Perspective than com-pared to that from human judgments. This alsoexplains why PBS, judged signiﬁcantly more civilthan MSNBC by people, appears to be somewhatless civil. Not only is Perspective not able to de-tect some of the incivility in sarcasm present inMSNBC, but also PBS segments include substan-tially more (sub-)error triggers than MSNBC.More segments from PBS contain uncivil words,compared to MSNBC. Table 6 shows some repre-sentative PBS segments with offensive words. Theyare often reported incivility, or occur in segmentswhere a guest in the program uses such language.Most of the segments are not overall uncivil.

The work we presented was motivated by the desireto apply off-the-shelf methods for toxicity predic-tion to analyse civility in American news. Thesemethods were developed to detect rude, disrespect-ful, or unreasonable comment that is likely to makeyou leave the discussion in an online forum. Tovalidate the use of Perspective to quantify incivilityin the news, we create a new corpus of perceivedincivility in the news. On this corpus, we comparehuman ratings and Perspective predictions. We ﬁnd that Perspective is not appropriate for such anapplication, providing misleading conclusions forsources that are mostly civil but for which peopleperceive a signiﬁcant overall difference, for exam-ple, because one uses sarcasm to express incivility.Perspective is able to detect less subtle differencesin levels of incivility, but in a large-scale analysisthat relies on Perspective exclusively, it will be im-possible to know which differences would reﬂecthuman perception and which would not.We ﬁnd that Perspective’s inability to differen-tiate levels of incivility is partly due to the spu-rious correlations it has formed between certainnon-offensive words and incivility. Many of thesewords are identity-related. Our work will facilitatefuture research efforts on debiasing of automatedpredictions. These methods start off with a list ofwords that the system has to unlearn as associatedwith a given outcome. In prior work, the lists ofwords to debias came from informal experimenta-tion with predictions from Perspective. Our workprovides a mechanism to create a data-driven listthat requires some but little human intervention. Itcan discover broader classes of bias than peopleperforming ad-hoc experiments can come up with.A considerable portion of content marked as un-civil by people is not detected as unusual by Per-spective. Sarcasm and high-brow register in thedelivery of the uncivil language are at play hereand will require the development of new systems.Computational social scientists are well-advisedto not use Perspective for studies of incivility inpolitical discourse because it has clear deﬁcienciesfor such application.

Acknowledgments

We thank the reviewers for their constructively crit-ical comments. This work was partially funded bythe National Institute for Civil Discourse. eferences

Pew Research Center. 2019. Public highly critical ofstate of political discourse in the u.s. Technical re-port.Thomas Davidson, Debasmita Bhattacharya, and Ing-mar Weber. 2019. Racial bias in hate speech andabusive language detection datasets. In

Proceedingsof the Third Workshop on Abusive Language Online ,pages 25–35, Florence, Italy. Association for Com-putational Linguistics.Thomas Davidson, Dana Warmsley, Michael Macy,and Ingmar Weber. 2017. Automated hate speechdetection and the problem of offensive language. arXiv preprint arXiv:1703.04009 .Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani,Eric Lehman, Caiming Xiong, Richard Socher, andByron C. Wallace. 2020. ERASER: A benchmark toevaluate rationalized NLP models. In

Proceedingsof the 58th Annual Meeting of the Association forComputational Linguistics , pages 4443–4458, On-line. Association for Computational Linguistics.Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,and Lucy Vasserman. 2018a. Measuring and miti-gating unintended bias in text classiﬁcation. In

Pro-ceedings of the 2018 AAAI/ACM Conference on AI,Ethics, and Society, AIES 2018, New Orleans, LA,USA, February 02-03, 2018 , pages 67–73.Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain,and Lucy Vasserman. 2018b. Measuring and miti-gating unintended bias in text classiﬁcation. In

Pro-ceedings of the 2018 AAAI/ACM Conference on AI,Ethics, and Society , pages 67–73.Antigoni-Maria Founta, Constantinos Djouvas, De-spoina Chatzakou, Ilias Leontiadis, Jeremy Black-burn, Gianluca Stringhini, Athena Vakali, MichaelSirivianos, and Nicolas Kourtellis. 2018. Largescale crowdsourcing and characterization of twitterabusive behavior. arXiv preprint arXiv:1802.00393 .Jeremy A Frimer and Linda J Skitka. 2018. The mon-tagu principle: Incivility decreases politicians’ pub-lic approval, even with their political base.

Journalof personality and social psychology , 115(5):845.Hila Gonen and Yoav Goldberg. 2019. Lipstick on apig: Debiasing methods cover up systematic genderbiases in word embeddings but do not remove them.In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 609–614,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Yiqing Hua, Cristian Danescu-Niculescu-Mizil, DarioTaraborelli, Nithum Thain, Jeffery Sorensen, and Lu-cas Dixon. 2018. WikiConv: A corpus of the com-plete conversational history of a large online col-laborative community. In

Proceedings of the 2018 Conference on Empirical Methods in Natural Lan-guage Processing , pages 2818–2823, Brussels, Bel-gium. Association for Computational Linguistics.Ben Hutchinson, Vinodkumar Prabhakaran, EmilyDenton, Kellie Webster, Yu Zhong, and Stephen De-nuyl. 2020. Social biases in NLP models as barriersfor persons with disabilities. In

Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics , pages 5491–5501, Online. As-sociation for Computational Linguistics.Kokil Jaidka, Alvin Zhou, and Yphtach Lelkes. 2019.Brevity is the soul of twitter: The constraint affor-dance and political discussion.

Journal of Commu-nication , 69(4):345–372.Kimberly Meltzer. 2015. Journalistic concern about un-civil political talk in digital news media: Responsi-bility, credibility, and academic inﬂuence.

The Inter-national Journal of Press/Politics , 20(1):85–107.Margaret Mitchell, Simone Wu, Andrew Zaldivar,Parker Barnes, Lucy Vasserman, Ben Hutchinson,Elena Spitzer, Inioluwa Deborah Raji, and TimnitGebru. 2019. Model cards for model reporting. In

Proceedings of the conference on fairness, account-ability, and transparency , pages 220–229.Diana C Mutz and Byron Reeves. 2005. The newvideomalaise: Effects of televised incivility on polit-ical trust.

American Political Science Review , pages1–15.Vinodkumar Prabhakaran, Ben Hutchinson, and Mar-garet Mitchell. 2019. Perturbation sensitivity analy-sis to detect unintended model biases. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 5740–5745, HongKong, China. Association for Computational Lin-guistics.Ian Rowe. 2015. Civility 2.0: A comparative analysisof incivility in online political discussion.

Informa-tion, communication & society , 18(2):121–138.Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,and Noah A. Smith. 2019. The risk of racial biasin hate speech detection. In

Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics , pages 1668–1678, Florence,Italy. Association for Computational Linguistics.Yannis Theocharis, Pablo Barber´a, Zolt´an Fazekas,and Sebastian Adrian Popa. 2020. The dynam-ics of political incivility on twitter.

Sage Open ,10(2):2158244020919447.Eric M Uslaner. 2000. Is the senate more civil than thehouse.

Esteemed colleagues: Civility and delibera-tion in the US Senate , pages 32–55.eerak Waseem and Dirk Hovy. 2016. Hateful sym-bols or hateful people? predictive features for hatespeech detection on twitter. In

Proceedings of theNAACL Student Research Workshop , pages 88–93,San Diego, California. Association for Computa-tional Linguistics.Mengzhou Xia, Anjalie Field, and Yulia Tsvetkov.2020. Demoting racial bias in hate speech detection.In

Proceedings of the Eighth International Work-shop on Natural Language Processing for Social Me-dia , pages 7–14, Online. Association for Computa-tional Linguistics.Marc Ziegele, Mathias Weber, Oliver Quiring, andTimo Breiner. 2018. The dynamics of online newsdiscussions: Effects of news articles and reader com-ments on users’ involvement, willingness to partici-pate, and the civility of their contributions.