Political Bias and Factualness in News Sharing across more than 100,000 Online Communities
PPolitical Bias and Factualness in News Sharingacross more than 100,000 Online Communities
Galen Weld, Maria Glenski, Tim Althoff Paul G. Allen School of Computer Science and Engineering Pacific Northwest National LaboratoryAbstract
As civil discourse increasingly takes place online, misinfor-mation and the polarization of news shared in online com-munities have become ever more relevant concerns with realworld harms across our society. Studying online news shar-ing at scale is challenging due to the massive volume of con-tent which is shared by millions of users across thousandsof communities. Therefore, existing research has largely fo-cused on specific communities or specific interventions, suchas bans. However, understanding the prevalence and spread ofmisinformation and polarization more broadly, across thou-sands of online communities, is critical for the developmentof governance strategies, interventions, and community de-sign. Here, we conduct the largest study of news sharing onreddit to date, analyzing more than 550 million links spanning4 years. We use non-partisan news source ratings from MediaBias/Fact Check to annotate links to news sources with theirpolitical bias and factualness. We find that, compared to left-leaning communities, right-leaning communities have 105%more variance in the political bias of their news sources, andmore links to relatively-more biased sources, on average. Weobserve that reddit users’ voting and re-sharing behaviorsgenerally decrease the visibility of extremely biased and lowfactual content, which receives 20% fewer upvotes and 30%fewer exposures from crossposts than more neutral or morefactual content. This suggests that reddit is more resilient tolow factual content than Twitter. We show that extremely bi-ased and low factual content is very concentrated, with 99%of such content being shared in only 0.5% of communities,giving credence to the recent strategy of community-widebans and quarantines.
Biased and inaccurate news shared online are major con-cerns that have risen to the forefront of public discourse re-garding social media in recent years. Two thirds of Amer-icans get at least some of their news content from socialmedia, but less than half expect this content to be accu-rate (Shearer and Matsa 2018). Globally, only 22% of sur-vey respondents trust the news in social media “most of thetime” (Newman 2020). Internet platforms such as Twitter,Facebook, and reddit account for an ever-increasing share ofthe dissemination and discussion of news (Geiger 2019).Harms caused by biased and false news have substantialimpact across our society. Polarized content on Twitter andFacebook has been shown to play a role in the outcome of elections (Recuero, Soares, and Gruzd 2020; Kharratzadehand ¨Ustebay 2017); and misinformation related to COVID-19 has been found to have a negative impact on public healthresponses to the pandemic (Tasnim, Hossain, and Mazumder2020; Kouzy et al. 2020). Developing methods for reduc-ing these harms requires a broad understanding of the polit-ical bias and factualness of news content shared online, butstudying news sharing is challenging for three reasons: (1)the scale is immense, with billions of news links shared an-nually, (2) it is difficult to automatically quantify bias andfactualness at scales where human labeling is often infeasi-ble (Rajadesingan, Resnick, and Budak 2020), and (3) thedistribution of links is complex, with these links shared bymany millions of users and thousands of communities.While previous research has led to important insights onspecific aspects of news sharing, such as user engagement(Risch and Krestel 2020), fact checking (Vosoughi, Roy,and Aral 2018; Choi et al. 2020), specific communities (Ra-jadesingan, Resnick, and Budak 2020), and specific rumors(Vosoughi, Mohsenvand, and Roy 2017; Qazvinian et al.2011), large scale studies of news sharing are critical to un-derstanding polarization and misinformation more broadly,and can inform community design, governance, and moder-ation interventions.In this work, we present the largest study to date of newssharing behavior on reddit, one of the most popular socialmedia websites. We analyze all 559 million links submit-ted to reddit from 2015-2019 , including 35 million newslinks submitted by 1.3 million users to 135 thousand com-munities. We rate the bias and factualness of linked-to newssources using Media Bias/Fact Check (MBFC), which con-siders how news sources favor different sides of the left-rightpolitical spectrum (bias), and the veracity of claims made inspecific news stories (factualness) (§3).In our analyses, we examine: the diversity of news withincommunities (§4), and how this diversity is composed ofboth the differences between community members and in-dividual members’ diversity of submissions; the impact of August 2019 was the most recent month of data available atthe time of this study. While bias and factualness may vary from story to story, newssource-level ratings maximize the number of links that can be rated,and are commonly used in research (Bozarth, Saraf, and Budak2020). a r X i v : . [ c s . C Y ] F e b urrent curation and amplification behaviors on news’ vis-ibility and spread (§5); and the concentration of extremelybiased and low factual content (§6), examining the distribu-tion of links from the perspectives of who submitted themand what community they were submitted to.We show that communities on reddit exist across theleft-right political spectrum, as measured by MBFC, but74% are ideologically center left. We find that the diver-sity of left-leaning communities’ membership is similar tothat of equivalently right-leaning communities, but right-leaning communities have 105% more politically variednews sources, as their members individually post more var-ied links. This variance comes from the presence of linksthat are different from the community average, and in right-leaning communities, 74% of such links are to relatively-more biased news sources, 35% more than in left-leaningcommunities. (§4).We demonstrate that, regardless of the political leaning ofthe community, community members’ voting and crosspost-ing (re-sharing) behavior reduces the impact of extremelybiased and low factual news sources. Links to these newssources receive 20% fewer upvotes (§5.2) and 30% fewerexposures from crossposts compared to more neutral andhigher factual content (§5.3). Furthermore, we find that userswho submit such content leave reddit 68% more quicklythan others. These findings suggest that low factual contentspreads more slowly and is amplified less on reddit than hasbeen reported for Twitter (Vosoughi, Roy, and Aral 2018),which may stem from reddit’s explicit division into commu-nities, or users’ ability to downvote content, both of whichhelp control content exposure.Extremely biased and low factual content can be challeng-ing to manage, as it is spread through many users, newssources, and communities. We find that extremely biasedand low factual content is spread by an even broader set ofusers and communities relative to news content as a whole,exacerbating this challenge (§6). However, we find that 99%of extremely biased or low factual content is still concen-trated in 0.5% of communities, lending credence to recentinterventions at the community level (Chandrasekharan et al.2017, 2020; Saleem and Ruths 2018; Ribeiro et al. 2020).Our work demonstrates that additional research on newssharing online is especially needed on the topics of whyusers depart platforms and where they go, why false newsappears to spread more quickly on Twitter than on reddit,and how curation and amplification practices can manage in-fluxes of extremely biased and low factual content. Finally,we make all of our data and analyses publicly available toencourage future work on this important topic. Misinformation and Deceptive News.
Social news plat-forms have seen continued increase in use and a simulta-neous increase in concern regarding biased news and misin-formation (Mitchell et al. 2019; Marwick and Lewis 2017).Recent studies have used network spread (Vosoughi, Roy, https://behavioral-data.github.io/news labeling reddit/ and Aral 2018; Ferrara 2017), content consumer (Allen et al.2020), and content producer (Linvill and Warren 2020) ap-proaches to assess the spread of misinformation. In thiswork, we examine news sharing behavior from news sourceswho publish content with varied degrees of bias or factu-alness, building on related work that has analyzed socialnews based on the characteristics of a new source’s audi-ence (Samory, Abnousi, and Mitra 2020a) or the type of con-tent posted (Glenski, Weninger, and Volkova 2018). Polarization and Political Bias.
Many papers have recentlybeen published on detecting political bias of online contenteither automatically (Baly et al. 2020; Demszky et al. 2019)or manually (Ganguly et al. 2020; Bozarth, Saraf, and Budak2020). Others have examined bias in moderation of content,as opposed to biased content or news sources themselves(Jiang, Robertson, and Wilson 2019, 2020). Echo cham-bers are a major consideration in understanding polariza-tion, with papers focusing on their development (Allison andBussey 2020) and the role of news sources in echo chambers(Horne, Nørregaard, and Adali 2019). Others have examinedwho shares what content with what political bias, but didso using implicit community structure (Samory, Abnousi,and Mitra 2020b). In this work, we examine thousands ofexplicit communities on reddit, characterizing their polar-ization by examining the political diversity of news sourcesshared within, and the diversity of the community memberswho contribute.
Moderation and Governance.
A large body of work hasexamined the role of moderation interventions such explana-tions (Jhaver, Bruckman, and Gilbert 2019), content removal(Chandrasekharan et al. 2018), community bans (Chan-drasekharan et al. 2017, 2020; Saleem and Ruths 2018) onoutcomes such as migration (Ribeiro et al. 2020), harass-ment (Matias 2019a) and harmful language use (Waddenet al. 2021). Others have focused on moderators themselves(Matias 2019b; Dosono and Semaan 2019), and technolog-ical tools to assist them (Jhaver et al. 2019; Zhang, Hugh,and Bernstein 2020; Chandrasekharan et al. 2019), as wellas self-moderation through voting (Glenski, Pennycuff, andWeninger 2017; Risch and Krestel 2020) and communitynorms (Fiesler et al. 2018). In contrast, our work informs theviability of different moderation strategies, specifically byexamining the sharing and visibility of news content acrossthousands of communities.
We analyze all reddit submissions to extract links, and anno-tate links to news sources with their political bias and factu-alness using ratings from Media Bias/Fact Check. reddit is the sixth most visited website in the world, and iswidely studied due to its size, diversity of communities, andthe public availability of its content (Medvedev, Lambiotte,and Delvenne 2018). Users can submit links or text (knownas “selfposts”) to specific communities, known as “subred-dits.” Users may view submissions for a single community,or create a “front page” which aggregates submissions from
015 2016 2017 2018 2019050100 % o f L i n k s t h a t c a n b e L a b e l e d BiasFactualness
Figure 1: The percentage of links that can be annotated us-ing the MBFC labels is very consistent ( ± , inclusive,for a total of 56 months of content (580 million submissions,35 million unique authors, 3.4 million unique subreddits).For each submission, we extract the URLs of each linked-towebsite, which resulted in 559 million links . Ethical Considerations.
We value and respect the privacyand agency of all people potentially impacted by this work.All reddit content analyzed in this study is publicly accessi-ble, and Pushshift, from which we source our reddit content,permits any user to request removal of their submissions atany time. We take specific steps to protect the privacy ofpeople included in our study (Fiesler and Proferes 2018).We do not identify specific users, and we exclusively ana-lyze data and report our results in aggregate. All analysisof data in this study was conducted in accordance with theInstitutional Review Board at institution anonymized . To identify and annotate links to news sources, we make useof Media Bias/Fact Check (hereafter MBFC), an indepen-dently run news source rating service. Bozarth, Saraf, andBudak (2020) find that “the choice of traditional news lists[for fact checking] seems to not matter,” when comparing 5different news lists including MBFC. Therefore, we selectedMBFC as it offers the largest set of labels of any news sourcerating service (Bozarth, Saraf, and Budak 2020). MBFC pro-vides ratings of the political bias (left to right) and factual-ness (low to high) of news outlets around the world, alongwith additional details and justifications for ratings, using arigorous public methodology . MBFC is widely used for la-belling bias and factualness of news sources for downstreamanalysis (Heydari et al. 2019; Main 2018; Starbird 2017; August 2019 was the most recent month available at the timeof this study. While link submissions by definition contain exactly one link,text submissions (selfposts) can include 0 or more links. mediabiasfactcheck.com/methodology/ Darwish, Magdy, and Zanouda 2017; Nelimarkka, Laakso-nen, and Semaan 2018) and as ground truth for predictiontasks (Dinkov et al. 2019; Stefanov et al. 2020).From MBFC’s public reports on each news source, weextract the name of the news source, its website, and the po-litical bias and factualness ratings. Bias is measured on a7-point scale of ‘extreme left,’ ‘left,’ ‘center left,’ ‘center,’‘center right,’ ‘right,’ and ‘extreme right,’ and is reported for2,440 news sources. Factualness is measured on a 6-pointscale of ‘very low factual,’ ‘low factual,’ ‘mixed factual,’‘mostly factual,’ ‘high factual,’ and ‘very high factual,’ andis reported for 2,676 news sources (as of April 2020). Forbrevity, in the following analyses, we occasionally use theterm ‘left leaning’ to indicate a news source with a biasrating of ‘extreme left,’ ‘left,’ or ‘center left,’ and the term‘right leaning’ to indicate a news source with a bias ratingof ‘center right,’ ‘right,’ or ‘extreme right.’We then annotate the links extracted from reddit sub-missions with the MBFC ratings using regular expres-sions to match the URL of the link with the domain ofthe corresponding news source. For example, a link to would be matched withthe rt.com domain of RT, the Russian-funded televisionnetwork, and annotated with a bias of ‘center right’ and afactualness of ‘very low.’ We find that links to center left andhigh factual news sources are most common, accounting for53% and 64% of all news links, respectively. Extreme leftnews source links are much less common, with 22.2 extremeright links for every 1 extreme left link (Fig. 2).
Validation of MBFC Annotations.
The use of fact check-ing sources such as MBFC is common practice forlarge scale studies, and MBFC in particular is widelyused (Dinkov et al. 2019; Stefanov et al. 2020; Heydariet al. 2019; Main 2018; Starbird 2017; Darwish, Magdy,and Zanouda 2017; Nelimarkka, Laaksonen, and Semaan2018). Additional confidence in MFBC annotations comesfrom the results of Bozarth, Saraf, and Budak (2020), whofind that (1) MBFC offers the largest set of biased and lowfactual news sources when compared among 5 fact check-ing datasets, and (2) the selection of a specific fact checkingsource has little impact on the evaluation of online content.Furthermore, we find that the coverage (the percentage oflinks that can be annotated using MBFC, excluding links toobvious non-news sources such as links to elsewhere on red-dit, to shopping sites, etc. ) is very consistent ( ± et al. dataset are not directly comparableto MBFC, we can create a comparable class by comparinglinks with a MBFC factualness rating of ‘mostly factual’or higher with Volkova et al. ’s ‘verified’ news sources. In xtreme Left Center Extreme Right Bias D e n s i t y Very Low Low Mixed Mostly High Very High
Factualness D e n s i t y Communities' MeansUsers' MeansAll Links
Figure 2: Distributions of mean bias and factualness arequite similar for both the user and community units of anal-ysis. Grey bars show the normalized total counts of links ofeach type across all of reddit.this case, when considering links that can be annotated us-ing both datasets, MBFC and Volkova et al. have a Cohen’skappa coefficient of 0.82, indicating “almost perfect” inter-rater reliability (Landis and Koch 1977). We examined ifthese differences could have an impact of downstream anal-ysis and found this to be unlikely. For example, results com-puted separately using MBFC and Volkova et al. agree withone another with a Pearson’s correlation of 0.98 on the taskof identifying the number of ‘mostly factual’ or higher linksposted to a community.
Computing Mean Bias/Factualness.
As described above,MBFC labels for bias and factualness are ordinal, yet formany analyses, it is useful to have numeric labels ( e.g. com-puting the variance of links in a community). To convertfrom MBFC’s categorical labels to a numeric scale, we usea mapping of (-3, -2, -1, 0, 1, 2, 3) to assign ‘extreme left’links a numeric bias value of -3, ‘left’ links a value of -2,‘center left’ links a value of -1, ‘center’ links a value of 0,and positive values to map to the equivalent categories onthe right. While this choice is somewhat arbitrary, it is con-sistent with the linear spacing between bias levels given byMBFC. Furthermore, we explored different mappings, in-cluding nonlinear ones, and found that our results are robustto different mappings. As such, we use the mapping givenabove as it is easiest to interpret. We use a similar mappingof (0, 1, 2, 3, 4, 5) to assign ‘very low factual’ links a nu-meric value of 0, ‘low factual’ links a value of 1, etc. , with‘very high factualness’ links assigned a value of 5.These numeric values are used to compute users’ andcommunities’ mean bias and mean factualness , central con-structs in our analyses. To do so, we simply take the averageof the numeric bias and factualness values of the links byeach user or in each community. For many of our analyses,we group users by rounding their mean bias/factualness tothe nearest integer. Thus, when we describe a user as havinga ‘left center bias,’ we are indicating that the mean bias ofthe links they submitted is between -1.5 and -0.5.The distributions of means are very similar for users and communities, with both closely following the overall distri-bution of news links on reddit, shown with grey bars (Fig. 2).74% of communities and 73% of users have a mean bias ofapproximately center left, and 65% of communities and 62%of users have a mean factualness of ‘high factual’ (amongusers/communities with more than 10 links).Similarly, we define user variance of bias as the varianceof the bias values of the links submitted by a user, and sim-ilarly community variance of bias is defined as the varianceof of the bias values of links submitted to a community. Aswith mean bias, we find that the distributions of user andcommunity variance of bias are very similar to one another.The median user has a variance of 0.85, approximately thevariance of a user with center bias who submits 62% centerlinks, 22% center-left or center-right links, and 16% left orright links. The median community has a variance of 0.91,approximately that of a community where 62% of the con-tent submitted has center bias, 20% of the content has center-left or center-right bias, and 18% of the content has left orright bias. Of course, a substantial amount of a community’svariance comes from the variance of its userbase. We exploresources of this variance in §4.
Links on reddit do not have equal impact; some links areviewed by far more people than others. To understand theimpact of certain types of content, we would like to under-stand how broadly that content is viewed. As view countsare not publicly available, we use the number of subscribersto the community that a link was posted to as an estimatefor the number of potential exposures to community mem-bers that this content may have had. While some users viewcontent from communities they are not subscribed to, sub-scription counts capture both active contributors and passiveconsumers within the community, which motivated our useof this proxy over other alternatives.As communities are constantly growing, we define thenumber of potential exposures to a link as the number ofsubscribers to the community the link was posted to at thetime it was posted . To estimate historic subscriber counts, wemake use of archived Wayback machine snapshots of sub-reddit about pages, which provide the number of subscribersat the time of the snapshot. These snapshots are availablefor the ∼ overestimate the potential expo-sures by using the first percentile value (4 subscribers) fromour subscriber count data. The effect of this imputation onour results is very minor as these only occur in communitiesith extremely little activity. In this section, we examine the factors that contribute toa community’s variance of bias. This variance can comefrom a combination of two sources: (1) community mem-bers who are individually ideologically diverse ( user diver-sity ), and (2) a diverse group of users with different meanbiases ( group diversity ). High user diversity corresponds toa community whose members have high user variance ( e.g. users who are ideologically diverse individually), and highgroup diversity corresponds to a community with high vari-ance of its members’ mean bias ( e.g. a diverse group ofusers who may be ideologically consistent individually). Ofcourse, these sources of variance are not mutually exclusive; overall community variance is maximized when both userdiversity and group diversity are large.
Method.
This intuition can be formalized using the Law ofTotal Variance, which states that total community varianceis exactly the sum of User Diversity (within-user variance)and Group Diversity (between-user variance):
Var( B c ) = E[Var( B c |U )] + Var(E[ B c |U ]) where B c is a random variable representing the bias of a linksubmitted to community c , and U is a random variable rep-resenting the user who submitted the link.We compute user diversity and group diversity for eachcommunity. User diversity is given by taking the mean ofeach user’s variance of bias, weighted by the number of la-beled bias links that user submitted. Group diversity is givenby taking the variance of each community members’ meanuser bias, again weighted by their number of labeled links.We then sum the user and group diversity values to computethe overall community variance of political bias.To understand how communities vary relative to theirmean, we compute the balance of links in the adjacent rel-atively more- and less- biased categories. For example, acommunity with ‘left’ mean bias has two adjacent cate-gories: ‘extreme left’ and ‘center left,’ with ‘extreme left’being the relatively-more biased category, and ‘center left’being the relatively-less biased category. Results.
Across all of reddit, we find most (82%) commu-nities’ group diversity constitutes a majority of their over-all variance of bias. When binned by their mean bias, wefind that communities with extreme bias have, on average,lower total variance than communities closer to the middleof the spectrum (Fig. 3). A community with mean bias of‘extreme left’ would be expected to have a lower total vari-ance as there are no links with bias further left than ‘extremeleft.’ To control for this dynamic, we only compare symmet-ric labels: ‘extreme left’ to ‘extreme right,’ ‘left’ to ‘right,’and ‘center left’ to ‘center right.’We find that right- and left-leaning communities havesimilar group diversity (Fig. 3, right), but right-leaning com-munities (red) have 341% more user diversity than equiv-alently left-leaning communities, on average (Fig. 3, left).As a result, the average overall variance is 105% greater C o mm un i t y ' s M e a n B i a s Figure 3: While group diversity is similar between left-and right-leaning communities with a similar degree ofbias (right panel), right-leaning communities have higheruser diversity than equivalently biased communities on theleft (left panel). As a result, right-leaning communitieshave higher overall variance around their community mean.Right-leaning communities also favor relatively-more bi-ased links, when compared to left-leaning communities.for right-leaning communities than left-leaning communi-ties. Interestingly, we find that a larger share of right-leaningcommunities’ variance is in more biased categories, relativeto the community mean. 74% of right-leaning communities’adjacent links are relatively-more biased, compared to 55%for left-leaning communities, in other words, an increase of35% (cid:16) (cid:17) . Implications.
These results suggest that members of com-munities on the left and right have comparable group di-versity, indicating the range of users are equally similarto one another. However, right-leaning communities havehigher user diversity, indicating that the individual usersthemselves tend to submit links to news sources with alarger variety of political leaning. This creates higher over-all variance of political bias in right-leaning communities,however these right-leaning communities also contain morelinks with higher bias, relative to the community mean, asopposed to more relatively-neutral news sources.
The impact of content on reddit is affected by users’ behav-ior: how long they stay on the platform, how they vote, andhow they amplify. In this section, we examine user longevityand turnover, community acceptance of biased and low fac-tual content, and amplification through crossposting.
Do users who post extremely biased or low factual contentstay on reddit as long as other users?
Method.
We compute each user’s lifespan on the platformby measuring how long they stay active on the platform aftertheir first submission. We define “active” as posting at leastonce every 30 days, as in Waller and Anderson (2019). We xtreme Left Center Extreme Right
Bias E x p e c t e d L i f e s p a n ( d a y s ) Very Low Low Mixed Mostly High Very High
Factualness E x p e c t e d L i f e s p a n ( d a y s ) Figure 4: Users with extreme mean bias stay on reddit lessthan half as long as users with center mean bias. Users withlow and very low mean factualness also leave more quickly,but expected lifespan decreases as users’ mean factualnessincreases past ‘mixed factual’. Across all figures, error barscorrespond to bootstrapped 95% confidence intervals (andmay be too small to be visible).group users by their mean bias and factualness, and for eachgroup, compute the expected lifespan of the group members.
Results.
We find that expected lifespan is longer for userswho typically submit less politically biased content, withusers whose mean bias is near center remaining on reddit forapproximately twice as long as users with extreme or mod-erate mean bias, on average (Fig. 4, top). This result holdsregardless of whether users are left- or right-leaning. Userswith a mean factualness close to ‘mixed factual’ or lowerleave reddit 68% faster than users whose mean factualnessis near ‘mostly factual’ (Fig. 4, bottom). However, we alsofind that users’ expected lifespan decreases dramatically astheir mean factualness increases to ‘high’ or ‘very high’ lev-els of factualness.
Implications.
These results suggest that users who mostlypost links to extremely biased or low factual news sourcesleave reddit more quickly than other users. We can onlyspeculate as to the causes of this faster turnover, but we notethat users who stay on reddit the longest tend to post links tothe types of news sources that are most prevalent (grey barsin Fig. 2 show overall prevalence of each type of link).The faster turnover suggests that users sharing this type ofcontent leave relatively early, limiting their impact on theircommunities. However, faster turnover also may make user-level interventions such as bans less effective, as these sanc-tions have shorter-lived impact when the users they are madeagainst leave the site more quickly. Future research couldexamine why users leave, whether they rejoin with new ac-counts in violation of reddit policy, and the efficacy of re-strictions of new accounts.
How do communities respond to politically biased or lowfactual content?
Method.
On reddit, community members curate content intheir communities by voting submissions up or down, whichaffects its position on the community feed (Glenski, Penny-cuff, and Weninger 2017). A submission’s ‘score’ is definedby reddit as approximately the number of upvotes minus thenumber of downvotes that post receives. The score has beenused in previous work as a proxy for a link’s reception bya community (Waller and Anderson 2019; Datta and Adar2019). Links submitted to larger communities are seen bymore users and therefore receive more votes. Therefore, wenormalize each link’s score by dividing by the mean scoreof all submissions in that community; links with a normal-ized score over are more accepted than average, and linkswith a score under are less accepted than average. In ac-cordance with reddit’s ranking algorithm, submissions withhigher normalized score appear higher in the feed viewedby community members, and stay in this position for longer(Medvedev, Lambiotte, and Delvenne 2018).To compute the community acceptance of links of a givenbias or factualness, we average the normalized score of alllinks of that type in that community. We then take the me-dian community acceptance across all left-leaning, right-leaning, and neutral communities. Here we use the medianas it is more resilient to outliers than the mean. Results.
We find that, regardless of the community’s po-litical leaning, median expected community acceptance is18% lower for extremely biased content than other con-tent (Fig. 5). For left-leaning and neutral communities, com-munity acceptance decreases monotonically as factualnessdrops below ‘high.’ However, we observe that right leaningcommunities are 167% ( p = 0 . ) more accepting of ex-treme right biased and 85% ( p = 0 . ) more accepting ofvery low factual content than left-leaning and neutral com-munities (Mann–Whitney U significance tests). Implications.
This suggests that across reddit, communitiesare sensitive to extremely biased and low factual content,and users’ voting behavior is fairly effective at reducing theacceptance of this content. However, curation does not seemto result in better-than-average acceptance for any content—no median acceptance values are significantly ( p < . )above 1, as non-news content tends to receive higher com-munity acceptance than news content.Previous research has found that on Twitter, news thatfailed fact-checking spread more quickly and was seen morewidely than news that passed a fact-check (Vosoughi, Roy,and Aral 2018). Interestingly, we find evidence that behav-ior on reddit is somewhat different, with median left-leaning,right-leaning, and neutral communities all being less accept-ing of low and very low factual content. Whereas Twitterusers are only able to retweet or like content, on reddit, usersmay downvote to decrease visibility of content they objectto. We speculate that this may partially explain the differ-ences in acceptance that we find between reddit and Twitter. xtreme Left Center Extreme Right Bias of Linked-to News Source M e d i a n C o mm un i t y A cc e p t a n c e Very Low Low Mixed Mostly High Very High
Factualness of Linked-to News Source M e d i a n C o mm un i t y A cc e p t a n c e Left-leaning communitiesNeutral communitiesRight-leaning communities
Figure 5: Regardless of the political leaning of the community, extremely biased content is less accepted by communities thancontent closer to center. Similarly, low and very low factual content is less accepted than higher factual content. Points perturbedon the x-axis to aid readability.
How does amplification of content affect exposure to biasedand low factual content? On reddit, users are not only ableto submit links to external content (such as news sites), butusers are also able to submit links to internal content else-where on reddit, effectively re-sharing and therefore ampli-fying content by increasing its visibility on the site. This iscommonly known as ‘crossposting,’ and often occurs whena user submits a post from one subreddit to another subred-dit, although such re-sharing of internal content can happenwithin a single community as well. Here, we seek to under-stand the role that amplification through crossposts has onreddit user’s exposure to various kinds of content.
Method.
To identify the political bias and factualness ofcrossposted content, we identify all crossposted links tonews sources, and propagate the label of the crosspostedlink. Then, we compute the fraction of total potential ex-posures from crossposts for each bias/factualness category.
Results.
We find that amplification via crossposting has anoverall small effect on the potential exposures of news con-tent. While 10% of all news links are crossposts, only 1% ofpotential exposures to news links are due to crossposts. Thissuggests that the majority of crossposts are content posted inrelatively larger communities re-shared to relatively smallercommunities with relatively fewer subscribers, diminishingthe impact of amplification via crossposting. As such, direct links to news sites have a far greater bearing on reddit users’exposure to news content than crossposts.However, the role of crossposts in exposing users to newcontent is still important, as crossposts account for morethan 750 billion potential exposures. We find that extremelybiased and low factual content is amplified less than othercontent, as shown in Fig. 6 which illustrates the percent-age of total potential exposures that come from crosspostsfor each bias/factualness category. reddit users exposed tocenter left biased, center biased, or center right biased con-tent are 53% more likely to be exposed to this content viaamplification than reddit users exposed to extremely biasedcontent. Similarly, reddit users exposed to ‘mostly factual’ or higher factualness content are 217% more likely to beexposed to such content via amplification than reddit usersexposed to very low factual content.
Implications.
Given that only 1% of potential exposures arefrom amplifications, understanding the way that direct linksto external content are shared is critical to understanding thesharing of news content on reddit more broadly.The relative lower amplification of extremely biased andvery low factual content suggests users’ sensitivity to thebias and factualness of the content they are re-sharing. Asin §5.2, this suggests differences between reddit and Twit-ter, where content that failed a fact-check has been found tospread more quickly than fact-checked content (Vosoughi,Roy, and Aral 2018). We speculate that this may be due tostructural differences between the two platforms. On red-dit, users primarily consume content through subscriptionsto communities, not other users. This may explain the dimin-ished impact of re-sharing on reddit compared to Twitter.
It is critical to understand where different news content isconcentrated in order to best inform strategies for monitor-ing and managing its spread online. In this section, we ex-amine how extremely biased and low factual content is dis-tributed across users, communities, and news sources. Wealso compare the concentration of extremely biased and lowfactual content to all content.
Method.
We consider three types of content: (1) news con-tent with extreme bias or low factualness, (2) all news con-tent, and (3) all content (including non-news). We groupeach of these types of content by three perspectives: the userwho posted the content, the community it was posted to, andthe news source (or domain, in the case of all content) linkedto. We then take the cumulative sum of potential exposuresacross the users, communities, and news sources, to computethe fraction of potential exposures contributed by the top n %of users, communities, and news sources. We repeat this pro-cess, replacing the number of potential exposures with the xtreme Left Center Extreme Right Bias
Factualness % P o t e n t i a l E x p o s u r e s f r o m C r o ss p o s t s Figure 6: Extremely biased and low factual content is ampli-fied by crossposts relatively less than other content. Regard-less of the bias or factualness of the content, while cross-posts are responsible for more than 750 billion potential ex-posures, they make up only 1% of total potential exposures,suggesting that direct links to news sources play an espe-cially important role in content distribution.total number of links, to consider the concentration of linksbeing submitted, regardless of visibility.
Results.
We find that overall, extremely biased and low fac-tual content is highly concentrated across all three perspec-tives, but is especially concentrated in a small number ofcommunities, where 99% of potential exposures stem from amere 109 (0.5%) communities (Gini coefficient=0.997) (Fig.7a). No matter the perspective, exposures to extremely bi-ased or low factual content (solid line) are less concentratedthan all content (dotted line) (Fig. 7abc).Under the community and news source perspectives, ex-posures (Fig. 7ac) are more concentrated than links (Fig.7df). While links are already concentrated in a small shareof communities, some communities are especially large, andtherefore content from these communities receives a dispro-portionate share of potential exposures. This is not the casefor users, as the distributions of exposures (Fig. 7b) are lessconcentrated than the distributions of links (Fig. 7e). Thisindicates that while some users submit a disproportionateshare of links, these are not the users whose links receive thelargest potential exposure, as potential exposure is primarilya function of submitting links to large communities.
Implications.
The extreme concentration of extremely bi-ased or low factual content amongst a tiny fraction of com-munities supports reddit’s recent and high profile decision totake sanctions against entire communities, not just specificusers (Isaac and Conger 2021). These decisions have beenextensively studied (Chandrasekharan et al. 2017, 2020;Thomas et al. 2021; Saleem and Ruths 2018; Ribeiro et al.2020). While this content is relatively less concentratedamongst users, in absolute terms, this content is still fairlyconcentrated, with 10% of users contributing 84% of poten- a Extreme Bias/VeryLow Factual Content All NewsContent All Content b % o f E x p o s u r e s c
50 75 100% of Communities d
50 75 100% of Users e
50 75 100% of News Sources 02550 % o f L i n k s f Figure 7: When compared to all content on reddit (dottedline), extremely biased or low factual content (solid line)is more broadly distributed, making it harder to detect, re-gardless of the community, user, or news source perspective.However, 99% of potential exposures to extremely biased orlow factual content are restricted to only 0.5% of communi-ties. Here, a curve closer to the lower-right corner indicatesa more extreme concentration. Note that axis limits do notextend from 0 to 100%.tial exposures. As such, moderation sanctions against userscan still be effective (Matias 2019b).
Summary & Implications.
In this work, we analyze all 580million submissions to reddit from 2015-2019, and annotate35 million links to news sources with their political bias andfactualness using Media Bias/Fact Check. We find:• Right-leaning communities’ links to news sources have105% greater variance in their political bias than left-leaning communities. When right-leaning communitieslink to news sources that are different than the commu-nity average, they link to relatively-more biased sources35% more often than left-leaning communities (§4).• Existing curation and amplification behaviors moderatelyreduce the impact of highly biased and low factual con-tent. This suggests that reddit is somewhat less prone tothe distribution of low factual content than Twitter, per-haps due to its explicit community structure, or the abilityfor users to downvote content (§5).• Highly biased and low factual content tends to be sharedby a broader set of users and in a broader set of commu-nities than news content as a whole. Furthermore, the dis-tribution of this content is more concentrated in a smallnumber of communities than a small number of users, as99% of exposures to extremely biased or low factual con-tent stem from only 0.5% or 109 communities (§6). Thislends credence to recent reddit interventions at the com-munity level, including bans and quarantines. imitations.
One limitation of our analyses is the use of asingle news source rating service, MBFC. However, the se-lection of news source rating annotation sets has been foundto have a minimal impact on research results (Bozarth, Saraf,and Budak 2020). MBFC is the largest (that we know of)dataset of news sources’ bias and factualness, and is widelyused (Dinkov et al. 2019; Stefanov et al. 2020; Heydariet al. 2019; Starbird 2017; Darwish, Magdy, and Zanouda2017). More robust approaches could combine annotationsfrom multiple sources, and we found that MBFC annotationsagree with the Volkova et al. (2017) dataset with a PearsonCorrelation of 0.96 on an example downstream task (§3.2).Our focus is on the bias and factualness of news sourcesshared online. We do not consider factors such as the contentof links ( e.g. shared images, specific details of news stories),or the context in which links are shared ( e.g. sentiment of asubmission’s comments). These factors are important areasfor future work, and are outside the scope of this paper.While MBFC (and by extension, our annotations) in-cludes news sources from around the world, our analyses,especially the left-right political spectrum and associatedcolors, takes a US-centric approach. Polarization and misin-formation are challenges across the globe (Newman 2020),and more work is needed on other cultural contexts.Our paper explores the impact of curation and amplifica-tion practices, but not the impact of community moderatorswho are a critical component of reddit’s moderation pipeline(Matias 2019b). Future work could examine news contentremoved by moderators.
Biased and inaccurate news shared online are significantproblems, with real harms across our society. Large-scalestudies of news sharing online are critical for understand-ing the scale and dynamics of these problems. We presentedthe largest study to date of news sharing behavior on red-dit, and found that right-leaning communities have more po-litically varied and relatively-more biased links than left-leaning communities, current voting and re-sharing behav-iors are moderately effective at reducing the impact of ex-tremely biased and low factual content, and that such contentis extremely concentrated in a small number of communi-ties. We make our dataset of news sharing on reddit public . Acknowledgements
This research was supported in part by the Office for NavalResearch, the Pacific Northwest National Laboratory, NSFgrant IIS-1901386, the Bill & Melinda Gates Foundation(INV-004841), and a Microsoft AI for Accessibility grant.
References
Allen, J.; Howland, B.; Mobius, M.; Rothschild, D.; andWatts, D. J. 2020. Evaluating the fake news problem at thescale of the information ecosystem.
Science Advances https://behavioral-data.github.io/news labeling reddit/ Allison, K.; and Bussey, K. 2020. Communal Quirks andCirclejerks: A Taxonomy of Processes Contributing to Insu-larity in Online Communities. In
ICWSM .Baly, R.; Da San Martino, G.; Glass, J.; and Nakov, P. 2020.We Can Detect Your Bias: Predicting the Political Ideologyof News Articles. In
EMNLP
ICWSM .Bozarth, L.; Saraf, A.; and Budak, C. 2020. Higher Ground?How Groundtruth Labeling Impacts Our Understanding ofFake News about the 2016 U.S. Presidential Nominees. In
ICWSM .Chandrasekharan, E.; Gandhi, C.; Mustelier, M. W.; andGilbert, E. 2019. Crossmod: A Cross-Community Learning-based System to Assist Reddit Moderators.
CHI
3: 1 – 30.Chandrasekharan, E.; Jhaver, S.; Bruckman, A.; andGilbert, E. 2020. Quarantined! Examining the Effectsof a Community-Wide Moderation Intervention on Reddit.
ArXiv abs/2009.11483.Chandrasekharan, E.; Pavalanathan, U.; Srinivasan, A.;Glynn, A.; Eisenstein, J.; and Gilbert, E. 2017. You Can’tStay Here: The Efficacy of Reddit’s 2015 Ban ExaminedThrough Hate Speech.
CHI
1: 31:1–31:22.Chandrasekharan, E.; Samory, M.; Jhaver, S.; Charvat, H.;Bruckman, A.; Lampe, C.; Eisenstein, J.; and Gilbert, E.2018. The Internet’s Hidden Rules.
CHI
2: 1 – 25.Choi, D.; Chun, S.; Oh, H.; Han, J.; et al. 2020. Rumorpropagation is amplified by echo chambers in social media.
Scientific Reports
SocInfo .Datta, S.; and Adar, E. 2019. Extracting Inter-communityConflicts in Reddit. In
ICWSM .Demszky, D.; Garg, N.; Voigt, R.; Zou, J.; Gentzkow, M.;Shapiro, J.; and Jurafsky, D. 2019. Analyzing Polarizationin Social Media: Method and Application to Tweets on 21Mass Shootings. In
NAACL-HLT .Dinkov, Y.; Ali, A.; Koychev, I.; and Nakov, P. 2019. Pre-dicting the Leading Political Ideology of YouTube ChannelsUsing Acoustic, Textual, and Metadata Information. In
IN-TERSPEECH .Dosono, B.; and Semaan, B. C. 2019. Moderation Practicesas Emotional Labor in Sustaining Online Communities: TheCase of AAPI Identity Work on Reddit.
CHI .Ferrara, E. 2017. Contagion dynamics of extremist propa-ganda in social networks.
Information Sciences
ICWSM .iesler, C.; and Proferes, N. 2018. “Participant” Perceptionsof Twitter Research Ethics.
Social Media + Society
ICWSM
IEEE TCSS
IEEE TCSS
ICWSM .Isaac, M.; and Conger, K. 2021. Reddit bans forumdedicated to supporting Trump.
The New York Times
TOCHI
26: 1 – 35.Jhaver, S.; Bruckman, A.; and Gilbert, E. 2019. Does Trans-parency in Moderation Really Matter?
CHI
3: 1 – 27.Jiang, S.; Robertson, R. E.; and Wilson, C. 2019. Bias Mis-perceived: The Role of Partisanship and Misinformation inYouTube Comment Moderation. In
ICWSM .Jiang, S.; Robertson, R. E.; and Wilson, C. 2020. Reasoningabout Political Bias in Content Moderation. In
AAAI .Kharratzadeh, M.; and ¨Ustebay, D. 2017. US PresidentialElection: What Engaged People on Facebook. In
ICWSM .Kouzy, R.; Jaoude, J. A.; Kraitem, A.; Alam, M. B. E.;Karam, B.; Adib, E.; Zarka, J.; Traboulsi, C.; Akl, E. A.; andBaddour, K. 2020. Coronavirus Goes Viral: Quantifying theCOVID-19 Misinformation Epidemic on Twitter.
Cureus
Biometrics
33 1: 159–74.Linvill, D. L.; and Warren, P. L. 2020. Troll factories: Man-ufacturing specialized disinformation on Twitter.
PoliticalCommunication
The Rise of the Alt-Right . Brookings Insti-tution Press.Marwick, A.; and Lewis, R. 2017. Media manipulation anddisinformation online.
Data & Society https://datasociety.net/library/media-manipulation-and-disinfo-online/. Matias, J. 2019a. Preventing harassment and increasinggroup participation through social norms in 2,190 online sci-ence discussions.
PNAS
Social Media + Society
ArXiv abs/1810.10881.Mitchell, A.; Gottfried, J.; Srocking, G.; Walker, M.;and Fedeli, S. 2019. Many Americans Say Made-Up News Is a Critical Problem That Needs To BeFixed.
Pew Research Center Science and Journalism
ACM DIS , 957–970. United States: ACM. doi:10.1145/3196709.3196764. ACM conference on Designing Interac-tive Systems, DIS ; Conference date: 09-06-2018 Through13-06-2018.Newman, N. 2020.
Digital News Report . Reuters Institute.Qazvinian, V.; Rosengren, E.; Radev, D.; and Mei, Q. 2011.Rumor has it: Identifying misinformation in microblogs. In
EMNLP , 1589–1599.Rajadesingan, A.; Resnick, P.; and Budak, C. 2020. Quick,Community-Specific Learning: How Distinctive ToxicityNorms Are Maintained in Political Subreddits. In
ICWSM .Recuero, R.; Soares, F. B.; and Gruzd, A. A. 2020. Hy-perpartisanship, Disinformation and Political Conversationson Twitter: The Brazilian Presidential Election of 2018. In
ICWSM .Ribeiro, M. H.; Jhaver, S.; Zannettou, S.; Blackburn, J.;Cristofaro, E. D.; Stringhini, G.; and West, R. 2020.Does Platform Migration Compromise Content Modera-tion? Evidence from r/The Donald and r/Incels.
ArXiv abs/2010.10397.Risch, J.; and Krestel, R. 2020. Top Comment or Flop Com-ment? Predicting and Explaining User Engagement in On-line News Discussions. In
ICWSM .Saleem, H. M.; and Ruths, D. 2018. The Aftermathof Disbanding an Online Hateful Community.
ArXiv abs/1804.07354.Samory, M.; Abnousi, V. K.; and Mitra, T. 2020a. Char-acterizing the Social Media News Sphere through User Co-Sharing Practices. In
ICWSM , volume 14, 602–613.Samory, M.; Abnousi, V. K.; and Mitra, T. 2020b. Char-acterizing the Social Media News Sphere through User Co-Sharing Practices. In
ICWSM
ICWSM .Stefanov, P.; Darwish, K.; Atanasov, A.; and Nakov, P. 2020.Predicting the Topical Stance and Political Leaning of Me-dia using Tweets. In
ACL
Journal of Preventive Medicine and Public Health
53: 171 –174.Thomas, P. B.; Riehm, D.; Glenski, M.; and Weninger, T.2021. Behavior Change in Response to Subreddit Bans andExternal Events.Volkova, S.; Shaffer, K.; Jang, J. Y.; and Hodas, N. 2017.Separating Facts from Fiction: Linguistic Models to ClassifySuspicious and Trusted News Posts on Twitter. In
ACL
KDD
Science
ICWSM .Waller, I.; and Anderson, A. 2019. Generalists and Special-ists: Using Community Embeddings to Quantify ActivityDiversity in Online Platforms. In
The World Wide Web Con-ference , WWW ’19, 1954–1964. New York, NY, USA: As-sociation for Computing Machinery. ISBN 9781450366748.doi:10.1145/3308558.3313729. URL https://doi.org/10.1145/3308558.3313729.Zhang, A. X.; Hugh, G.; and Bernstein, M. S. 2020. Poli-cyKit: Building Governance in Online Communities.