[PDF] Exploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media

Abstract

When users on social media share content without considering its veracity, they may unwittingly be spreading misinformation. In this work, we investigate the design of lightweight interventions that nudge users to assess the accuracy of information as they share it. Such assessment may deter users from posting misinformation in the first place, and their assessments may also provide useful guidance to friends aiming to assess those posts themselves. In support of lightweight assessment, we first develop a taxonomy of the reasons why people believe a news claim is or is not true; this taxonomy yields a checklist that can be used at posting time. We conduct evaluations to demonstrate that the checklist is an accurate and comprehensive encapsulation of people's free-response rationales. In a second experiment, we study the effects of three behavioral nudges -- 1) checkboxes indicating whether headings are accurate, 2) tagging reasons (from our taxonomy) that a post is accurate via a checklist and 3) providing free-text rationales for why a headline is or is not accurate -- on people's intention of sharing the headline on social media. From an experiment with 1668 participants, we find that both providing accuracy assessment and rationale reduce the sharing of false content. They also reduce the sharing of true content, but to a lesser degree that yields an overall decrease in the fraction of shared content that is false. Our findings have implications for designing social media and news sharing platforms that draw from richer signals of content credibility contributed by users. In addition, our validated taxonomy can be used by platforms and researchers as a way to gather rationales in an easier fashion than free-response.

Full PDF

118Exploring Lightweight Interventions at Posting Time toReduce the Sharing of Misinformation on Social Media

FARNAZ JAHANBAKHSH,

Computer Science and Artificial Intelligence Laboratory, MassachusettsInstitute of Technology, USA

AMY X. ZHANG,

Allen School of Computer Science & Engineering, University of Washington, USA

ADAM J. BERINSKY,

Political Science, Massachusetts Institute of Technology, USA

GORDON PENNYCOOK,

Hill/Levene Schools of Business, University of Regina, Canada

DAVID G. RAND,

Sloan School of Management/Brain and Cognitive Sciences, Massachusetts Institute ofTechnology, USA

DAVID R. KARGER,

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute ofTechnology, USAWhen users on social media share content without considering its veracity, they may unwittingly be spreadingmisinformation. In this work, we investigate the design of lightweight interventions that nudge users to assessthe accuracy of information as they share it. Such assessment may deter users from posting misinformation inthe first place, and their assessments may also provide useful guidance to friends aiming to assess those poststhemselves.In support of lightweight assessment, we first develop a taxonomy of the reasons why people believe anews claim is or is not true; this taxonomy yields a checklist that can be used at posting time. We conductevaluations to demonstrate that the checklist is an accurate and comprehensive encapsulation of people’sfree-response rationales.In a second experiment, we study the effects of three behavioral nudges—1) checkboxes indicating whetherheadings are accurate, 2) tagging reasons (from our taxonomy) that a post is accurate via a checklist and 3)providing free-text rationales for why a headline is or is not accurate—on people’s intention of sharing theheadline on social media. From an experiment with 1668 participants, we find that both providing accuracyassessment and rationale reduce the sharing of false content. They also reduce the sharing of true content, butto a lesser degree that yields an overall decrease in the fraction of shared content that is false.Our findings have implications for designing social media and news sharing platforms that draw fromricher signals of content credibility contributed by users. In addition, our validated taxonomy can be used byplatforms and researchers as a way to gather rationales in an easier fashion than free-response.CCS Concepts: •

Human-centered computing → Empirical studies in collaborative and social com-puting ; Empirical studies in HCI .Additional Key Words and Phrases: Misinformation, Social Media, Behavioral Nudges, Reasons Why PeopleBelieve News

Authors’ addresses: Farnaz Jahanbakhsh, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute ofTechnology, Cambridge, USA; Amy X. Zhang, Allen School of Computer Science & Engineering, University of Washington,Seattle, USA; Adam J. Berinsky, Political Science, Massachusetts Institute of Technology, Cambridge, USA; Gordon Pennycook,Hill/Levene Schools of Business, University of Regina, Regina, Canada; David G. Rand, Sloan School of Management/Brainand Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, USA; David R. Karger, Computer Science andArtificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, USA.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,contact the owner/author(s).© 2021 Copyright held by the owner/author(s).2573-0142/2021/4-ART18https://doi.org/10.1145/3449092Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. a r X i v : . [ c s . H C ] F e b ACM Reference Format:

Farnaz Jahanbakhsh, Amy X. Zhang, Adam J. Berinsky, Gordon Pennycook, David G. Rand, and David R.Karger. 2021. Exploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformationon Social Media.

Proc. ACM Hum.-Comput. Interact.

5, CSCW1, Article 18 (April 2021), 42 pages. https://doi.org/10.1145/3449092

Social media has lowered the barrier to publishing content. While this empowerment of theindividual has led to positive outcomes, it has also encouraged the fabrication of misinformation bymalicious actors and its circulation by unwitting platform users [24, 75]. Given widespread concernsabout misinformation on social media [6, 8, 15, 18, 60], many researchers have investigated measuresto counter misinformation on these platforms. Some of these initiatives include detecting false ormisleading information using machine learning algorithms [12, 57, 67] and crowdsourcing [5, 9, 19,33, 54], identifying bad actors and helping good actors differentiate themselves [82], and providingfact-checked information related to circulated news claims [23, 35, 50, 81]. Social media companiesthemselves have also enlisted more contract moderators as well as third-party fact-checkers toremove or down-rank misinformation after publication [4].While credibility signals and fact-check flags provide useful filters and signals for readers, andalgorithmic and human moderators help stop misinformation that is already spreading, none ofthese initiatives confront the underlying design of social media that leads to the proliferation ofmisinformation in the first place. Part of the reason misinformation spreads so well is that socialmedia prioritizes engagement over accuracy. For instance, a user who encounters an incorrect postand then leaves a comment refuting it may in fact be inadvertently disseminating it farther becausethe system considers the comment as engagement. A related driver is an emphasis on low barriersto sharing that allows users to share content without much attention to its accuracy or potentialnegative consequences, simply to receive social feedback or elevated engagement [7, 25].Thus, in this work, we begin to explore how one might alter social media platforms to bettersurface accuracy and evidence, not just engagement, when spreading content. To achieve this, weconsider how to raise some small barriers to sharing in order to promote accurate content, withoutdrastically changing the lightweight sharing practices typical in social media. For example, sharingof any kind of content would likely plummet if platforms demanded that users receive substantialtraining in assessing information before being allowed to share, or that they perform extensiveresearch on every post before sharing it. Instead, we look to interventions matched in scale tocurrent sharing behavior such as clicking a “like button” or emoticon, adding a hashtag, or writinga comment. We therefore explore the potential impact of (i) requiring sharers to click a button toindicate whether they think a story is accurate or not when sharing it, (ii) requiring sharers tochoose at least one tag from a small checklist indicating why they consider it accurate, and (iii)writing a short comment explaining their accuracy assessment. Introducing these interventions atposting time would encourage users to reflect on veracity before posting and perhaps reconsiderposting an inaccurate piece of content. These assessments could also provide valuable informationto other users seeking to form their own assessments of a post’s accuracy.In this work, we describe two studies completed using participants from Mechanical Turkmotivated by these ideas. The first aims to develop a small set of categories that (i) cover themajority of reasons people consider content to be accurate or inaccurate and (ii) can be usedaccurately by them when providing assessments. This allows us to test as one of our interventionsa checklist that can reasonably replace a free-text response. The second study considers the impact The Trust Project: https://thetrustproject.org/, Credibility Coalition: https://credibilitycoalition.org/Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:3 of our three proposed interventions—accuracy assessment, reasoning “categories” provided by thefirst study, and textual reasoning—on sharing behavior.

Prior work has shown that priming people to think about accuracy when they are consideringsharing a post can help reduce their likelihood of sharing falsehoods [52, 53]. But little work hasbeen done on how asking about accuracy could be integrated into platforms in a lightweight way.In the first study, we examine how to capture people’s rationale for a claim’s (in)accuracy in astructured format. To create categories of rationales, we ask the following research question: • RQ1:

What are the reasons why people believe or disbelieve news claims?In this study, we developed a taxonomy of reasons and presented a set of claims along with thetaxonomy to participants to choose from, as they provided their rationales for the (in)accuracy ofthe news claims. We iterated on the taxonomy until participants were able to use it reliably to labeltheir rationales.Armed with our taxonomy of reasons, the second research question that we address is: • RQ2 : How does asking people to provide accuracy assessments of news claims and theirrationales affect their self-reported sharing intentions on social media?Based on results from prior literature [52, 53], our hypothesis is that asking people about contentaccuracy lowers their intention of sharing false content, but also of sharing true content to somedegree. An ideal intervention would significantly reduce sharing of false content without muchimpacting the sharing of true. We further hypothesize that additionally asking people to providetheir reasoning when evaluating the accuracy of content would help them even more with thediscernment than simply asking for an accuracy judgement.The study we conducted involved presenting news claims spanning different domains andpartisan orientations to a set of participants and asking a subset if the claims were accurate andanother subset their reasons for believing so in addition to accuracy assessment. For each newsclaim, participants indicated whether they would share it on social media. We find that asking aboutaccuracy decreases the likelihood of sharing both true and false news but of false news to a greaterdegree. We also find that additionally asking for rationale via free-text leads to a similar outcome,reducing sharing of both true and false headlines, but further decreasing the ratio of shared falseheadlines to true ones. We delve deeper into providing rationales by comparing a condition wherereasoning is provided in free-text to one where users must also select from a checklist of the reasoncategories taken from our taxonomy. We find that having people additionally work through thechecklist of reasons does not further reduce ratio of shared false headlines to true ones comparedto free-text reasoning alone. However, such structured rationales can be beneficial for integrationinto platforms as signals of credibility.Our findings on the effects of the accuracy and reasoning nudges on sharing decisions pointto potential designs for social media platforms to introduce at posting time to reduce sharing ofmisinformation. In addition, our taxonomy of rationales can be used to form a checklist for users toreport their reasons for believing or disbelieving a claim in an easier fashion. Because these inputsare structured annotations , they provide a potential added benefit in that they can be aggregated and presented to friends of the poster, for example indicating that “2 friends think this post is trueand 27 think it is false, 2 based on firsthand knowledge.” This aggregate assessment could helpwarn a user about misinformation and could also help guide users to friends who are misinformedand could benefit from a conversation about the post. We conclude by discussing these and otherpossible social media designs that could be introduced as a result of this work.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

We situate our study in prior work related to how misinformation spreads and what factors affectpeople’s sharing behaviors, measures to combat misinformation, and factors influencing people’sperceptions of content accuracy.

Although misinformation is not a recent phenomenon [56], the fairly recent use of online platformsfor its dissemination has gained misinformation fresh attention. Researchers have focused on defin-ing the problem space of misinformation [77, 78] and determining how it has changed the societaland political arena [11, 62]. A body of work examines data collected from online communities tostudy how people use social media to seek and share information, and how misinformation spreadsthrough these communities [16, 48]. For example, Starbird et al. investigate rumors that emergeduring crisis events and report on the diffusion patterns of both false content and its corrections[70, 71]. Vosoughi et al. analyze the spread of rumor cascades on Twitter and find that, amongfact-checked articles, false content diffuses farther, faster, deeper, and more broadly than the truth[75]. By analyzing tweets during and following the US 2016 election, Shao et al. find evidencethat social bots played a role in spreading articles from low-credibility sources. In addition, theyidentify common strategies used by such bots, e.g., mentioning influential users [63]. Other workhas examined echo chambers on social media and their selective exposure to information, e.g., incommunities that relate to certain conspiracy theory narratives [16, 44, 59, 61].Another strand of research tries to understand how people’s sharing behavior on social platformsis impacted by different aspects of the content or the platform. For instance, Vosoughi et al. reportthat people are more likely to share information that is novel, a characteristic that false contentusually has [75]. Pennycook et al. report that subtly priming people to be mindful of contentaccuracy before deciding whether to share the content helps lower their intention of sharing falseinformation [52, 53]. They argue that this phenomenon happens because although people aregenerally good at discerning accuracy, when deciding whether to share content, they are lessconcerned with accuracy than other aspects of sharing such as the amount of social feedback theyreceive. Focusing their attention on accuracy can help alleviate the problem. This interpretation isconsistent with prior work that reports people who rely more on their intuition and engage in lesscritical thinking are more susceptible to believing political fake news in survey experiments [55]and in fact share news from lower quality sources on Twitter [45].We extend this body of work by studying the effects of asking people to provide accuracyassessments as well as their reasoning for why a news story is or is not accurate on their intentionsof sharing the content, quantifying the degree to which they reduce sharing of false (and true)content. If these nudges prove to be effective at curbing the sharing of inaccurate content, newssharing platforms and social media can benefit from implementing them.

Another body of work has been investigating how to combat misinformation. For example, Bodeand Vraga investigate the roles that authoritative and expert sources, peers of social media users,and platforms play in correcting misinformation [10, 76]. Other such studies investigate the impactof the wording of corrections [10, 39] and explore identifying misinformation in its early stagesusing previously known rumors [80], presenting linguistic and social network information aboutnew social media accounts to help users differentiate between real and suspicious accounts [31],and increasing media literacy [28].

A related thread of work studies how social media users engage with fact-checking in the wild.Zubiaga et al. explore how rumors are diffused on social media and how users respond to thembefore and after the veracity of a rumor is resolved. They report that the types of tweets that areretweeted more are early tweets supporting a rumor that is still unverified. Once a rumor has beendebunked however, users do not make the same effort to let others know about its veracity [83]. Inanother study on rumors, Shin et al. analyze rumors spread on Twitter during the 2012 election andfind that they mostly continued to propagate even when information by professional fact-checkingorganizations had been made available [64]. Shin and Thorson analyze how Twitter users engagewith fact-checking information and find that such information is more likely to be retweeted ifit is advantageous to the user’s group (partisanship) [65]. Other work has studied circumstancesunder which fact-checking is effective and find that social media users are more willing to acceptcorrections from friends than strangers [29, 38]. In addition, a Twitter field experiment foundthat being corrected by a stranger significantly reduced the quality of content users subsequentlyshared [43].Social media platforms have also been exploring ways to ameliorate their misinformation problem.Facebook for example, started showing red flags on certain posts to signal their lack of credibility.Such flags however, encouraged people to click on the content, causing Facebook to remove theflags in favor of adding links to related articles underneath the posts [3]. Facebook has also reportedthat it lowers the ranking of groups and pages that spread misinformation about vaccination[1]. Platforms have recently turned to working with third-party fact-checkers to remove contentthat violate their community policies [4]. These measures in general force the platforms into atruth-arbitration role which is especially problematic in cases where policies have not had theforesight to predict all accounts of problematic posts or in grey areas [66, 68]. We are interested inexploring “friend-sourced” methods in which the platforms are only responsible for delivering theassessments and not for making them.We study the effects of two behavioral nudges, requesting accuracy assessments and rationales,on sharing false news as countermeasures that could be incorporated into social media platforms.To best leverage them, we also study how to capture people’s rationales in structured form. Wehypothesize that requesting users to assess accuracy of news headlines at sharing time acts as abarrier to posting, reducing sharing of false content but also of true content to a lesser degree.In addition, we hypothesize that further asking users to provide rationales for their accuracyassessments will result in a higher reduction in sharing of false headlines, and potentially of trueheadlines although to a lesser degree.

A body of work has been investigating the characteristics of posts or of people’s interaction withthem that affect their perceived accuracy. For instance, number of quoted sources [74], priorexposure to a claim [51], official-looking logos and domains names [79], and post topic and authorusername [42] have been found to impact perceptions of news credibility, whereas the newspublisher has surprisingly little effect on news accuracy ratings [17]. Pennycook et al. find thatattaching warning flags to a subset of news stories increases the perceived credibility of thosewithout flags [50]. By conducting interviews with participants whose newsfeeds were manipulatedto contain false posts, Geeng et al. study why people do not investigate content credibility, e.g.,because they have undergone political burnout [22].Our study builds upon prior work by investigating self-reported reasons why people believeor disbelieve claims. We develop a taxonomy from these reasons and revise it iteratively untilpeople untrained in the taxonomy are able to use it reliably to label their own rationales. Wehypothesize that by leveraging these structured rationales and deliberating on all the dimensions

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. of accuracy, users will be more discerning of false vs true content compared to if they provideunguided free-form reasoning. In addition, structured reasons have the added benefit that theycould easily be integrated into platforms as signals of content credibility.

In this section we introduce terminology that we will use to discuss and evaluate our interventions,some of which are also used in the course of the studies.

Our overall goal is to study interventions at the moment of sharing content online. To evaluatethese interventions, we seek a meaningful performance metric. Interventions that occur only whena user has decided to share can only prevent some sharing, thus reducing the amount of sharingoverall. An intervention that might be considered ideal would not prevent sharing of true contentbut would prevent all sharing of false content. More generally, it is useful to separately consider thedegree to which an intervention reduces sharing of true content and the degree to which it reducessharing of false. Previous work [50, 52, 53] often assessed the performance of an intervention bycomparing the change in the absolute difference between the rate at which true and false contentwas shared. But here, we argue that an intervention which results in no change in the difference insharing rates can still be highly beneficial if it changes the ratio of sharing rates.Consider for example a user who shared 20% of the true content and 10% of the false content theyencountered. If the “input stream” of content they read were balanced between true and false, thenthey would share twice as much true content as false, meaning the “output stream” of content theyshared would be 2/3 true to 1/3 false. Now suppose that an intervention decreases their sharingrate on both true and false content by 5%, to 15% and 5% respectively. There is no change in theabsolute difference in sharing rates, but the user’s new output stream consists of 3/4 true contentand 1/4 false. Going farther, if both rates are decreased by 10% the output stream will contain onlytrue content.Therefore, in assessing the effect of interventions, we focus on the (change in the) ratio of sharingrates rather than the difference. If a user shares a fraction 𝑓 of false posts and a fraction 𝑡 of trueposts, then an input stream with a ratio 𝑟 of false posts to true posts will yield an output streamwith a ratio 𝑓 𝑟 / 𝑡 of false to true. Thus, an intervention that reduces the discernment ratio 𝑓 / 𝑡 willimprove the output false-true ratio regardless of the change in the difference of sharing rates. Ofcourse this comes at a cost: a reduction in the overall sharing of true content. Different platformsmay have different cost-benefit analyses of this trade-off. Note also that a portion of the benefitcan be acquired at a portion of the cost by invoking the intervention on only a certain fraction ofthe shares.On many social platforms, content one user consumes and shares is content that has been sharedby others. If each user’s assessments improve the false-true ratio by a factor 𝑓 / 𝑡 , then over a chainof 𝑘 sharing steps the ratio is improved by a factor of ( 𝑓 / 𝑡 ) 𝑘 overall; so even minor improvementsaccumulate exponentially. We now introduce relevant terminology. Previous work has shown that personal deliberationcan improve people’s ability to distinguish accurate from inaccurate information [82]. And ourinterventions seek to engender different amounts of that deliberation. Initially, it is possible thatusers might choose to share certain information without even considering whether it is accurate.Our minimal intervention, simply asking a user whether a claim is accurate, already forces usersto deliberate at least enough to answer the question. We expect that spending more time and

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:7 effort in deliberation will generally improve a user’s accuracy assessments. However, there is alimit to this improvement based on a user’s available knowledge (and inaccurate knowledge) andrationality that will prevent their ever assessing perfectly. We use veracity to refer to whether theclaim is accurate or not independent of the user’s knowledge. We use perceived accuracy to refer tothe user’s initial answer to whether they consider the claim accurate. We expect that subjectiveassessment of accuracy will correlate with objective accuracy, but imperfectly. Finally, we definethe average perceived accuracy of a claim to define the fraction of users who perceive the claim asaccurate.

The objective of our first study (Taxonomy study) was to develop a taxonomy of self-reportedreasons why people believe or disbelieve news claims. In the second (Nudge study), we investigatedwhether asking people to provide accuracy assessments and reasoning for the (in)accuracy of a newsclaim before they share it on social media nudges them to be more mindful of its credibility and ifthis nudge affects their sharing behavior. We further examined the effects of different instruments(a free-format text-box or a structured set of checkboxes) for capturing reasons on sharing newsstories that are not credible. Our study was approved by our Institutional Review Board.

We collected most of the claims we used in our studies from Snopes , with a few from mainstreammedia. Each claim was presented with a picture, a lede sentence, and a source that had originallypublished an article on the claim, similar to news stories on social media (see Figure 1) and alsobecause users do not generally read entire articles and mainly limit their attention to headlines [37].The claims varied along different dimensions of veracity, domain, partisanship, and original source.For the claims that we collected from Snopes, we had the ground-truth that the website’s fact-checkers had provided. We fact-checked the few that we collected from mainstream media byinvestigating the sources to which they referred. For domain, we chose claims that were eitherpolitical or about science and technology, with claims in both domains covering a wide range ofissues. Some of the claims were pro-Republican, some pro-Democratic, and others had no clearpartisanship. The claims came from various sources including mainstream media, conspiracywebsites, and social media. For the claims that had originally been circulated on social media suchas Facebook, we displayed the source as “Posted via Facebook.com”.Because we intended for our political claims to be relevant at the time of the study and notoutdated, for each iteration of the study, we collected new political claims that had emerged orre-emerged within the past few months prior to the iteration. Selecting relevant headlines resultedin the iterations of the Taxonomy study having different but overlapping sets of claims, whichsupported our goal of a generalizable taxonomy. Another set of claims was used for the Nudgestudy which was conducted in one iteration. These claims had a large overlap with those used inthe last iteration of the Taxonomy study. We have provided this set in Appendix F.In addition to relevance, we considered provenance when selecting claims for the study. We chosethose claims for which Snopes had provided the originating source or the ones that it explicitlymentioned as being widespread rumors. For example, some claims had been requested to be fact-checked by Snope’s readership and therefore did not have a clear source or a place where they hademerged and therefore, we did not select these claims. In addition, we filtered out those claims thatwere not factual (e.g., satire) because including them would have required presenting the item atthe article and not the headline level. https://snopes.com Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. Fig. 1. An example of a headline in the study.Headlines were shown with an image, a lede sen-tence, and a source. Fig. 2. The survey interface for the Nudge studyiteration 4, where for each reason that a partici-pant selected, they were required to fill in a text-box explaining their rationale.

We recruited U.S. based participants from Amazon Mechanical Turk. Across the 4 iterations ofthe Taxonomy study, 317 participants provided at least one (non-spam) answer to the news itemspresented to them. Of those, 305 completed the full survey and provided demographic information.The number of (non-spam) participants for the Nudge study was 1668, of whom 1502 provideddemographic information. Of the 1807 participants across both studies who provided demographicinformation, 42% were female. 47% identified as Democratic, 25% as Republican, and 26% as Inde-pendent. They were distributed across a wide age range with a median of 35 years. The medianincome was $40,000 to $49,999.The payment for the Taxonomy study was $3. We determined the pay for the Nudge study ($3.75)based on a soft-launch of the HIT with 100 workers which had a median completion time of 25minutes. The soft-launch also revealed a high spam rate, leading us to limit the HIT to workerswith a past approval HIT rating of higher than 95% for all requesters.

We developed a taxonomy for people with no prior training to label their own rationales for whythey (dis)believed a claim. We therefore, assigned a descriptive label to each category from a firstperson’s point of view (e.g., “The claim is not consistent with my past experience and observations.”).A goal was for this description to be one that subjects could use correctly—that is, that participantswould generally agree with each other, and with us, about the meaning of a particular category.We used multiple iterations of our study in order to achieve this goal, as described below.

Through an online survey, participants were shown a series of 10 claims one at a time, withthe claims randomly chosen from a pool. For each claim, the survey asked whether the claimwas accurate or inaccurate, why, and how confident the participant was in their belief (4 pointscale). Participants then answered another survey gauging their critical thinking [21], statisticalnumeracy and risk literacy [14], political knowledge, and attitudes towards science. These questionswere drawn from research studying users’ judgements assessing claims [55]. Finally, participantsanswered demographics questions on political preference and theistic ideologies, among others.

We performed this study in 2 stages. To develop a preliminary taxonomy, we ran a first stage inwhich participants provided rationales for believing or disbelieving each of the claims via free-textresponses. A total of 50 participants completed this stage of the study. We first divided the free-formresponses that we collected into idea units, each being a coherent unit of thought [73]. This resultedin 534 idea units. A member of the research team then conducted a first pass over the idea unitsand assigned preliminary categories to each using a grounded theory approach [13].In the second stage of the study, we worked iteratively to refine our taxonomy, consolidatingcategories showing too much overlap and splitting some that represented distinct ideas. A particulargoal was for the categories and their labels to align with how participants labeled their ownresponses. In this stage, for each claim, participants were asked to mark checkboxes correspondingto the reasoning categories they used and then to provide elaboration in text. To measure thealignment between our intended use for the categories and participants’ perception of them, amember of the research team with no knowledge of the users’ checked categories assigned categoriesto the elaborated reasons. We then measured Cohen’s Kappa as a measure of the agreement betweenthe categories selected by the research team coder and the ones participants had selected for theirown responses. We conducted 3 rounds of this study, each time iterating over the categories.The Kappa score in our initial iterations of the study was low which led us to revise the reasoncategories. However, we discovered that the low score was partly an artifact of the study design. Inthe initial iterations, participants were asked to first mark their reasons using the checkboxes andthen to provide an aggregate explanation in one text-box. We noticed that the explanations didnot cover all the selected reasons possibly because participants did not feel impelled to providecomprehensive explanations or that they deemed some checkboxes self-explanatory. We addressedthis issue by modifying the survey interface so that for each checkbox that a participant selected,they were required to fill in a text-box explaining their reasoning (see Figure 2).Our attained agreement score in the 3rd iteration of this stage of the study was 0.63 which wemeasured across 729 idea units collected for 48 news claims. The score exceeded the recommendedthreshold for accepting the results [36]. Other scholars suggest a higher threshold for various tasks.However, while our attained agreement score may be lower than ideal, we deem it sufficient forthis type of task. The 4 iterations of the Taxonomy study spanned 13 months.

Some categories emerging from the study would definitively determine that a claim was or wasnot accurate from the evaluator’s perspective. We signalled the strength of these categories bygrouping them under the name

Accurate on the evidence and

Inaccurate by contrary knowledge .Some rationales on the other hand were not conclusive but rendered a claim (im)plausible. Forthese, we used the terms

Plausible and

Implausible to indicate the strength. Other rationales weresurmises and speculations. Although these rationales were not informative, we nonetheless wantedpeople to have the means to provide such rationales while indicating their lack of strength. Wegrouped these under the term

Don’t know .We determined in the earlier iterations that the

Don’t know categories were self-explanatoryand the elaborations often repeated the terms. We therefore did not ask for elaborations whenparticipants selected these categories in the final iteration.Two categories emerged in the taxonomy that were used by participants in the initial stages asreasons a claim was inaccurate, but that we concluded were not reliable for deducing accuracy butrather belonged to other dimensions of credibility. One of these categories,

The claim is misleading ,could be used for claims that for instance are accurate if taken literally, but are intentionally wordedin such a way to lead the reader to make inaccurate deductions. Similarly, the category

The claimis not from a trusted source could be applicable to claims that are in fact accurate since sources of

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. unknown reputation and even malicious ones publish accurate information mixed with false contentor propaganda [69]. Therefore, we separated these two categories from the rest and requested thatparticipants evaluate each claim on these two signals regardless of whether they judged the claimas accurate. Tables 1, 2, and 3 show the full taxonomy.

Here we elaborate on some of the categoriesthat were not self-explanatory. Throughout the paper, where we present participants’ free-textresponses, we identify them with a string of the form “p-” + a participant number to preserve theiranonymity. If a participant is from an iteration other than the final, we have concatenated thefollowing string to the end of their identifier: “-” + iteration number.

One of the mostcited rationales for a claim’s (in)accuracy, was that it was (not) consistent with the participant’spast experience and observations. This assessment at times resulted from the participant’s generalknowledge of how an authority, e.g., the law, operates: “I am pretty sure that to sign up for thesebenefits you would need a social security number.” (p-9) — Claim: Seniors on Social Security Have toPay for Medicare While ‘Illegal Immigrants’ Get It Free .At other times, the rationale was based on whether the assertion in the claim matched the subjectof the claim’s past profile or pattern of behavior: “Joe Biden has a history of gaffes that come off thissilly.” (p-79) — Claim: Joe Biden Said: ‘Poor Kids Are Just as Talented as White Kids’ .Sometimes the assessment referred to whether the claim confirmed or contradicted the customarystate of world as the participant perceived it: “It has been my experience that attitudes on sexualorientation have changed considerably” (p-53) — Claim: Age Matters More than Sexual Orientationto U.S. Presidential Voters, Poll Finds . This rationale also emerged in cases where the participanthad heard about similar claims before and although the claim’s accuracy had not been established,the repeated encounter made it seem plausible: “I’ve been hearing this statement since I was achild, snowflakes are like fingerprints. There are none that are identical.” (p-32-3) — Claim: No TwoSnowflakes Are Exactly Alike .This phenomenon has also been reported in [51], where Pennycook et al. found that even asingle prior exposure to a headline increases subsequent perceptions of its accuracy. Surprisingly,the illusory truth effect of repeated false statements influences people across the political spectrum,i.e., even if their ideological beliefs disagree with the statements [47].In fact, in the earlier iterations of the taxonomy,

Familiarity of Claim was a separate categoryfrom

Consistency with Past Experiences and Observations . However, we merged the two because inmany instances participants could not make the distinction.

Participants reported howa claim looks as a factor impacting their perception of the claim’s accuracy. They referred tosensational language— “It’s too much like a sensationalist tabloid headline.” (p-10) , grammaticalerrors— “I really have no idea if this is true or not but the language seems weird.” (p-46) , clickbaittitles— “The title seems to be clickbait” (p-27) , and quality of presented artifacts— “The image lookslike a generic image that is commonly used with clickbait. The website referenced is "cityscrollz.com"which has a stylized spelling and does not seem to be a reliable news source.” (p-65) as indicators of anarticle’s falsehood. Interestingly, some of these factors are among the set of indicators for evaluatingcontent credibility suggested by Zhang et al. [82]. While Zhang et al.’s proposed indicators areintended for evaluating both accurate and inaccurate content, in our study, this type of rationalewas cited only as an argument for refuting a claim. One explanation is that because we only showedheadlines and not full articles, participants may have interpreted the presence of these factors asred flags invalidating the claim, but their absence simply calling for more investigation.

Table 1. Taxonomy of reasons why people believe news claims.

Category Example A cc u r a t e o n t h ee v i d e n c e I have a high degree of knowl-edge on this topic that allowsme to assess this claim myself(e.g., I teach/write about thistopic or I use this in my work).(N=2) “Global warming is really happening around us and we must stop it. I have researchedit for some time now.” (p-3)

Claim: Global Sea Ice is at a Record Breaking Low.

I have firsthand knowledge ofthe subject or am an eyewit-ness. (N=23) “My own dog does this when other dogs come into the picture, I can see her gettingjealous for my attention.” (p-60)

Claim: A Study Showed that Dogs Exhibit Jealousy.

My other trusted sources (be-sides the source of this arti-cle) confirm the entire claim.(N=54) “I’ve read numerous articles from Huffington Post, Buzzfeed, and Mashable aboutthis happening.” (p-26)

Claim: Some Phone Cameras Inadvertently Opened While Users Scrolled Face-book App.

The claim is from a source Itrust. (N=49) “I do trust the Washington Post to report accurately.” (p-47)

Claim: Gun Violence Killed More People in U.S. in 9 Weeks than U.S. CombatantsDied in D-Day [source: WashingtonPost.com].Evidence presented in the ar-ticle corroborates the claim.(N=9) “The mountain peaks in the background look the same. I think the claim is very likelyto be true.” (p-23)

Claim: These Photographs Show the Same Spot in Arctic 100 Years Apart. [Theclaim is presented with two juxtaposed photos, one showing mountains coveredby glaciers, and in the other the glaciers have almost completely melted.] P l a u s i b l e The claim is consistent withmy past experience and ob-servations. (N=120) “This seems fairly consistent. The media seems to only report when Trump doeswrong, even from the start.” (p-71)

Claim: President Trump’s Awarding of a Purple Heart to a Wounded Vet WentUnreported by News Media. D o n ’ t k n o w I’m not sure, but I want theclaim to be true. (N=32) “It is an interesting claim so I would hope that it was true but I’ve never heard ofYUP and Twitter isn’t very reliable.” (p-46-3)

Claim: There is a Point in the Ocean Where the Closest Human Could Be anAstronaut. [The picture presented with the claim shows a tweet explaining theclaim from the source YUP.]I was just guessing. (N=104) “I have no knowledge of the headlines, but it seems plausible that it could be truebased off just a guess.” (p-38-3)

Claim: There Are More Trees on Earth Than Stars in the Milky Way.

Other (N=14) “It’s probably not the whole story, and is probably connected to the lack of federalrecognition of same sex marriages. As in because the marriages aren’t recognized,the adoption of foreign national children by those couple as a couple aren’t [sic]recognized, so the child can’t be naturalized, etc.” (p-37)

Claim: The Trump Administration Is Denying U.S. Citizenship to Children ofSame-Sex Couples.

Sometimes participants determined that a claimwas false because it seemed to advance a particular agenda. In most cases, they did not know aboutthe accuracy of the particular claim, but based their assessment on their prior familiarity with thegeneral topic or the source of the claim: “Fox news is biased and probably doesn’t like whatever the"new way forward" is and wants to focus on the negatives.” (p-45) — Claim: The New Way ForwardAct Would Protect Criminals from Deportation [source: FoxNews.com]. This type of rationalegenerally surfaced for partisan issues and claims that were associated with particular groups andmovements: "The source has a liberal bias, and likely is including suicides as ‘violence’ which mostwill interpret as person on person violence." (p-24) — Claim: Gun Violence Killed More People in U.S.in 9 Weeks Than U.S. Combatants Died in D-Day [source: WashingtonPost.com].

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Table 2. Taxonomy of reasons why people disbelieve news claims.

Category Example I n a cc u r a t e b y c o n t r a r y k n o w l e d g e I have a high degree of knowl-edge on this topic that allowsme to assess this claim myself(e.g., I teach/write about thistopic or I use this in my work).(N=3) “Asylum seekers can only apply on US soil. I’ve worked with immigrants abroad foran extended period of time.” (p-2)

Claim: Asylum-Seekers Can Apply at U.S. Embassies Abroad.

I have firsthand knowledgeon the subject or am an eye-witness. (N=7) “I watched the news coverage.” (p-67)

Claim: ABC, CBS, and NBC Blacked Out Pam Bondi’s Legal Defense of Trumpduring His Impeachment Trial.

The claim contradicts someinformation related to thecase that I know from trustedsources. (N=49) “I think that number is almost equal to the total number of homicides in the US whichis a ridiculous notion.” (p-80)

Claim: 10,150 Americans Were Killed by Illegal Immigrants in 2018. I m p l a u s i b l e The claim is not consistentwith my past experience andobservations. (N=46) “The man is a showman, there’s no way he’d do something like this without lettinganyone know about it.” (p-30)

Claim: President Trump’s Awarding of a Purple Heart to a Wounded Vet WentUnreported by News Media.

If this were true, I would haveheard about it. (N=91) “I feel [something] like this would have been a huge story that would have beencarried on many national news networks.” (p-39)

Claim: US Intelligence Eliminated a Requirement That Whistleblowers ProvideFirsthand Knowledge.

The claim appears to be in-accurate based on its presen-tation (its language, flawedlogic, etc.). (N=14) “This tweet follows the standard "ah! everybody panic!" format you see for unsub-stantiated information. Also, there is no link to a source.” (p-79)

Claim: Presidential Alerts Give the Government Total Access to Your Phone. [Theaccompanying picture shows a tweet by John McAfee warning about the issue.]The claim appears motivatedor biased. (N=90) “Not saying which library or for what reason leads people to come up with their ownconclusions.” (p-30)

Claim: Biden’s Campaign Demanded an American Flag Be Removed from aLibrary.

The claim references some-thing that is impossible toprove. (N=12) “‘Most sane of all’ is a hard metric to measure.” (p-29)

Claim: A Scientific Study Proved that “Conspiracists” Are “The Most Sane of All”. D o n ’ t k n o w I’m not sure, but I do not wantthe claim to be true. (N=99) “Knowing that I don’t want to believe but aware of Biden’s inaccurate pronouncements,I just chose not to give this any credence because, it is irrelevant.” (p-37-3)

Claim: Joe Biden said the Mass Shootings in El Paso and Dayton Happened in‘Houston’ and ‘Michigan’.

I was just guessing. (N=74) “I am purely guessing here because I don’t know. I’d rather not believe somethingthat is true than believe something that is actually false.” (p-36-3)

Claim: Mueller Concluded Trump Committed ‘No Obstruction’ in the 2016 Elec-tion Probe.

Other (N=61) “If migrants could leave at any time, why wouldn’t they leave. What would be thepoint in detaining them and sending them to a center if they were free to do as theyplease in the first place?” (p-9)

Claim: Migrants ‘Are Free to Leave Detention Centers Any Time’.

The taxonomy categoriesunderwent multiple iterations of transformation based on the misalignment between our expecteduse of the categories and how participants used them. Here we elaborate on some examples of thismisalignment as potential areas for future refinement.

This category was intended for caseswhere a participant had heard about the claim from other sources or had learned about it from an

Table 3. Other Signals of credibility that do not necessarily render a claim accurate or inaccurate.

Category Example

The claim is mislead-ing. ( 𝑁 Accurate = 𝑁 Inaccurate = “It might be that women end up with DNA-containing fluid/skin cells during sex, but notpermanently. Or during pregnancy some of DNA from the fetus (which would contain some ofthe male partner DNA) might end up in the woman’s blood. But not every partner’s.” (p-37) Claim: Women Retain DNA From Every Man They Have Ever Slept With.

The claim is not from asource I trust. (N=4) “NBC has a habit of exaggerating to obtain ratings.” (p-88)

Claim: Watchdog: ICE Doesn’t Know How Many Veterans It Has Deported [source: NBC-News.com]. authority (e.g., at school). Sometimes, participants believed that they had heard about the claimbefore but that they did not fully remember the particulars of the claim they had heard or fromwhom they heard it, but that having encountered it nonetheless made them more accepting of itsplausibility. In these cases, they often used the category

The claim is consistent with my experienceand observations , as seen in the following example: “It seems like something I have read/heard beforein the past.” (p-16-3) — Claim: U.S Sen. Lindsey Graham Once Said a Crime Isn’t Required forImpeachment.

Therefore, it appears that the degree of confidence in one’s rationale in addition tothe type of the rationale can shift the assigned label from one of these categories to the other.

In the earlier iterations of the taxonomy, this category was referred to as

I Have Specific Expertiseon the Subject . However, we discovered that the definition of expertise varied across participants,as demonstrated by this example that was labeled by the participant as belonging to the expertisecategory: “I have seen this nearly exact headline on social media, and was curious about its claims.Turns out that, from what I read on the government site, that this headline is misleading.” (p-28-3)

Inthe subsequent iterations, in addition to refining the name of the category, we added the example

Iteach/write about this topic or I use this in my work to better convey the intended bar for expertise.

Participants occasionallysaid they had investigated the sources of a claim or had heard another source verify or refute theclaim and thus had firsthand knowledge: “I saw it from reliable sources, I witnessed the news articlesmyself firsthand.” (p-67)

However, our intended category for such occasions would be

My othersources confirm the entire claim if considered accurate, and

The claim contradicts some informationrelated to the case that I know from trusted sources if considered inaccurate.

We investigated whether the demographics of participants have an effecton the types of rationales that they produce. We found that the distribution of rationales differstatistically across genders and present details in the Appendix Section A.

Some categories in Table 3 may appear similar to other rationales that can render a claim implausibleand need to be further distinguished. One such pair is a claim being misleading and its appearingto be inaccurate based on its presentation (e.g., use of sensational language). In the absence of otherinformation, sensational language renders the claim implausible, i.e., fails to convince the user thatthe claim is accurate. The claim being misleading however, is not a reason why the claim should beaccurate or inaccurate, but exists in addition to the accuracy dimension. Users could consult otherrationales to determine the accuracy or plausibility of the claim, and determine that for instance,although the claim is accurate, the accurate information pieces are chosen and the article crafted insuch a way as to imply a particular inaccurate message. Another such pair is the claim appearing

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. motivated or biased and its source being one that the user does not trust. To determine whether aclaim is motivated or biased, users often resort to such information as their familiarity with thesubject matter or their prior knowledge of the agenda of the source as well as their inferred biasof the claim to determine whether the truth has been twisted in the claim. Therefore, when themessage of the claim agrees with the bias of the source, users see it as an indication that the claimmay in fact not be accurate. A separate dimension of credibility is whether the source is not trustedby the user. For instance, a user who has stated they do not trust Fox News may in fact assess Fox’sclaim that Biden won the presidential election in Arizona [72] as accurate, the claim as not biased,while maintaining that the claim is from a source they do not trust.We now discuss some general issues that arose in this study.There is an important logical asymmetry between being consistent and inconsistent with pastexperience. Inconsistency with the past does offer some level of indication that the claim is inaccu-rate. However, given the tremendous number of things that could happen consistent with the past,consistency offers little evidence that a claim is accurate—instead, it only fails to provide evidencethat the claim is not accurate. In general, many participants seemed not to make this distinction,using consistency with the past as sufficient evidence that a claim is true. Because participants usedthis category often, system designers may feel compelled to make it available. But those designersmight want to consider treating this category as indicating that the user does not know whether theclaim is accurate, rather than indicating accuracy.The confusion of lack of refutation with accuracy offers opportunities for manipulation. Itsuggests that a subject might tend to believe that a politician voted yes, and equally that a politicianvoted no, simply because they saw one or the other headline, without any other evidence.In a similar vein, some subjects treated

The claim is not from a source I trust as a reason to considera claim false. Related work shows that alternative media sources borrow content from other sourcesincluding mainstream media [69]. Therefore, it is important to bring users to this realization thatsources of unknown or low reputation may in fact publish accurate content. While we observedthat users can rather reliably use the taxonomy in its current format to characterize their rationales,the taxonomy can still benefit from further refinement, which we leave to future work.

We hypothesized that asking people to reflect on the accuracy of a news story before they shareit on social media would help prevent sharing news stories that are not credible. We additionallyhypothesized that getting people to consider their rationales would help with their deliberation.For these purposes, we used the taxonomy that we developed in the Taxonomy study as one optionto nudge people to consider possible rationales, along with a second free-text option.

Table 4 summarizes our experimental conditions. Similar to the Taxonomy study, participants wereshown a series of 10 claims one at a time via an online survey. Headlines were randomly drawnfrom a set of 54 headlines. Of the pool of headlines, 24 were true, 25 false, and 5 were assessed asbeing a mixture of true and false. If a participant was in any of the treatment conditions, for eachclaim, the survey would ask whether the claim was accurate or inaccurate and how confident theparticipant was in their belief (4 point scale), displayed in Figure 3. If the participant was in oneof the reasoning conditions, the survey would additionally ask why they believed the claim was(in)accurate. At the end of each item, all participants were asked if they would consider sharing thearticle on social media, with options Yes, Maybe, and No, displayed in Figure 4. We also followedup with another question asking why they would (not) consider sharing the article. Following theclaims, participants answered a survey asking how they would verify the accuracy of a headline

Fig. 3. The UI for how accuracy and confidencequestions were presented to a participant alongwith an example of a headline. Fig. 4. The UI for how we asked users whetherthey would consider sharing the headline pre-sented to them.Table 4. Experimental conditions of the Nudge study. Table shows participants in which conditions werepresented with questions to assess the accuracy of claims, provide their reasoning for why the claim is (not)accurate, and whether we used the taxonomy we developed in the Nudge study to capture their reasoning.

Cond. Assessed accuracy Provided reasoning Reasoning format1 – – –2 ✓ – –3 ✓ ✓ Free-form text4 ✓ ✓

Checkbox of taxonomy categories + textlike what they saw and how comfortable they were with asserting their judgments publicly. Thenthey answered partisanship questions for each claim that they had previously seen: “Assuming theabove headline is entirely accurate, how favorable would it be to Democrats versus Republicans?”(5 point scale). The survey contained similar post-task questions as the Taxonomy study. The fullquestionnaire is included in the Supplementary Materials. The Nudge study was completed in 12days.

Our dataset from the Nudge study contained 21,113 datapoints of which we identified 5,403 asspams by investigating their associated free-text responses. The responses that we labeled as spamwere copy pastes, sometimes with minor modifications, of the headline title or unrelated answersto the question (e.g., responding “good” to all the questions). Other responses that we labeled asspams had glaring grammatical errors and were mismatched between the participant’s accuracyassessment and their text. In these cases, we deemed that the participant had not received theintended treatment and therefore, we treated their response as a spam. Exclusions were applied atthe participant level as whenever we determined that a datapoint was suspicious of being spam, we

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. examined all the other datapoints submitted by the same participant as well as their responses tothe survey that followed the claims, described in Section 6.1. In almost all of the cases, the qualitiesthat disqualified a datapoint were present in all of that participant’s responses, and therefore, welabeled all the participant’s submitted datapoints as spam.A datapoint that we discarded as spam for example was the following: “the claim will be dan-gerous.” —in response to

Why do you think the claim is inaccurate? ; “it gives the fear about thetreatment.” —in response to Why would you not consider sharing it? ; both from the same datapoint;

Claim: Tibetan Monks Can Raise Body Temperature With Their Minds.

We also excluded 6 data-points where participants had technical issues. The datapoints that we included in the analyseswere collected from 1,668 participants. Participants did not always complete all 10 claims due todropout or technical issues; also, we remove from our analyses some datapoints participants labeledfor headlines whose ground truth was neither completely true nor false (mixture), which werecollected for exploratory purposes. From the datapoints that we included in the analyses, 3,740were in condition 1, 3,977 in condition 2, 3,405 in condition 3, and 3,118 in condition 4.

To test the effect of the nudges and their interaction with objective or subjectiveveracity of headlines, we fit two types of models to our dataset, both with share intention as thedependent variable. One was a linear mixed effect model for which we assigned values of 0, 0.5,and 1 to the share decisions of “No”, “Maybe”, and “Yes” respectively. The other was a cumulativelink mixed model which treated the share decisions as ordinal. Results were consistent between thetwo models. Because the linear model is more straightforward to interpret, we discuss the resultsof this model below and leave the results of the cumulative link model to the Appendix section C.

To test the effect of nudges and their interaction with objective veracity ofheadlines, we developed a linear mixed effect model with sharing intention as the dependent variableand our study treatments as independent variables. The treatments were whether participantswere asked about accuracy, whether they were asked about reasoning, and whether the reasonchecklist was presented to them. The model also included the veracity of the headline and theinteraction between veracity and each of the treatments as independent variables. We fit thismodel to the whole dataset. We included participant identifier and the headline the participant hadassessed as random effects in our models. The inclusion of these random effects accounts for thenon-independence between the data points provided by one participant or for one headline andcaptures the variance in the intercepts between participants or headlines. We used the function“lmer” from the R package "lme4" to define the model. We refer to this model as the veracity model :share ∼ veracity × ( accuracy condition + reasoning condition + reasoning format) + ( | participant ) + ( | claim ) (1)We also developed another more refined veracity model with the demographics of participantsas control variables, discussed in Appendix B. The effects we observed for headline veracity aswell as our treatments in the model discussed in the Appendix were consistent with the results weobserved for the model outlined in this section. Although the ultimate desired sharing behavior is for peopleto share content that is in fact true and refrain from sharing objectively false stories, the bestachievable outcome for behavioral nudges that encourage deliberation is to guide people to cometo a better discernment based on what they already know . One realization for example, could bethat maybe after all, they do not know what they previously had taken for granted. Therefore,in addition to examining how the nudges affect sharing of objectively true and false content, we

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:17 investigate how they interact with headlines that participants had initially believed to be true orfalse, indicated by their accuracy assessments.Therefore, to test how the treatments and their interaction with a participant’s initial accuracyassessment affect sharing intentions, we fit a model similar to the one described in 6.2.1.1 butincluded perceived accuracy, i.e., participant’s assessment of the accuracy of the headline, ratherthan veracity as the independent variable. Because we did not have accuracy assessments fromparticipants in the control (condition 1), we fit this model to the data from conditions 2, 3, and 4.The treatments that we included as independent variables were whether participants were askingabout reasoning and whether we presented the reason checklist to them. We refer to this model asthe perceived accuracy model :share ∼ perceived accuracy × ( reasoning condition + reasoning format) +( | participant ) + ( | claim ) (2) We performed a Wald Chi-Square test on each of the fitted models to determineif our explanatory variables were significant. The tests revealed that the effect of veracity in theveracity model and the effect of perceived accuracy in the perceived accuracy model were bothsignificant [ 𝜒 ( ) = . 𝑝 < .

001 for veracity, 𝜒 ( ) = . 𝑝 < .

001 for perceivedaccuracy]. Post-hoc Estimated Marginal Means tests revealed that participants were more likely toshare an objectively true rather than a false headline [ 𝑧 = . 𝑝 < . 𝑧 = . 𝑝 < . We observed that providing accuracy assess-ment had a significant effect on sharing intentions [ 𝜒 ( ) = . 𝑝 < . 𝜒 ( ) = . 𝑝 = . We saw that whether participants provided reasoninghad a significant effect on sharing intentions in both the veracity and perceived accuracy models[ 𝜒 ( ) = . 𝑝 = .

004 for the veracity model, 𝜒 ( ) = . 𝑝 = .

001 for the perceived accuracymodel].Figure 5 shows sharing likelihood for condition 3, where we requested participants’ rationale forwhy they believed a claim was or was not accurate, and condition 2, where we did not, across bothtrue and false headlines.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Fig. 5. Share rate of true and false headlines across study conditions. The results suggest that people areless likely to share both accurate and inaccurate content if they are asked to assess the content’s accuracy(condition 1 vs 2). We observe similar trends if they provide their reasoning in addition to assessing accuracy,compared with if they only assess accuracy (condition 2 vs 3). These interventions however, lower sharingof false content to a greater degree. The means of sharing true and false content also both decrease whenpeople are asked to provide checkbox reasoning in addition to free-text (condition 3 vs 4). The ratio of sharedfalse to true content however does not change.

Similar to the results we observed for requesting accuracy assessments, requesting reasoningreduces sharing of false headlines to a greater degree (27%) compared to the decrease in sharingof true headlines (14%). Therefore, the ratio of false shared headlines to shared true headlines isreduced.Figure 6 shows that requesting reasoning resulted in less sharing of headlines that participantsinitially believed as true but did not have an impact on sharing of perceived false content. Thelack of reduction in sharing of subjectively false headlines however, could be because their sharingrate was very low to begin with and therefore there was not much room for improvement (6% incondition 2, 5% in condition 3). As expected, because the sharing of subjectively false headlines didnot change to a great extent, but that sharing of headlines initially perceived as true decreased, theinteraction between reasoning and perceived accuracy was significant in the perceived accuracymodel [ 𝜒 ( ) = . 𝑝 < . 𝜒 ( ) = . 𝑝 = . The effect of reasoning format on sharing was significant inboth the veracity and perceived accuracy models [ 𝜒 ( ) = . 𝑝 = .

03 for the veracity model, 𝜒 ( ) = . 𝑝 = .

03 for the perceived accuracy model].Figure 5 shows that the sharing likelihood mean in the checkbox condition has a decrease of17% for false headlines, and 18% for true headlines, compared to the free-text condition. Figure 6shows that similar to the results we observed for requesting reasoning, presenting participants with

Fig. 6. Share rate of headlines by their perceived accuracy, regardless of actual veracity. Participants are lesslikely to share content they initially perceived as true when they are asked about reasoning, or when they arerequested to work through the checklist of reason categories. Sharing does not differ for headlines that wereinitially perceived as false, which could be because sharing of these headlines is rare to begin with. reason checkboxes resulted in less sharing of content that participants initially perceived as trueand did not lower sharing of headlines that were perceived as false. Sharing of content perceivedas false however, was already rare (5% in condition 3, 4% in condition 4). This was the reason whywe observed an interaction between reasoning format and perceived accuracy was significant inthe perceived accuracy model [ 𝜒 ( ) = . 𝑝 = . 𝜒 ( ) = . 𝑝 = . We examined participants’ free-text responsesto understand why they were willing to share a fraction, albeit small, of the headlines they perceivedas false. One member of the research team used open coding to assign labels to participant responses(a total of 427). Of the responses that provided a reason, the most cited was that they believedthe story was entertaining or that they thought it amusing to see which of their social mediafriends would believe the story: “If I felt in a playful mood I might post this just to see how manypeople no longer recognize satire” (22%). Others stated they would consider sharing the claim afterfact-checking it: “If I could verify the contents this would be worth sharing.” (20%). Some consideredthe claim a debate starter: “I would share this because I think it would spark a good debate betweenpro and anti gun members. It would be interesting to see if people actually believe this information inthe headline is true or not.” (17%). Other reasons included because they wanted to let their socialcircle know that the claim is false: “I would share this only to point out the misinformation availableon most anything.” (11%) or that they wished to fact-check the claim: “to see if anyone can provethe authenticity of the image” (10%). Some participants believed it was important to inform theirfriends about the claim in case it turned out to be true: “Just in case it is accurate and, in either case,would make people look into the claim.” (9%). Some pointed out that they would share the articlesimply because it was interesting “It is interesting, even though I am not sure if it is true or not.” (4%).Others wished to share their emotions or frustration about the article with their social circle: “Justto point out how absurd the title of this article is.” (4%).

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Interestingly, in a few instances, we saw that although a participant had originally labelled aclaim as inaccurate, they knowingly decided to share it to help advance their view: “Because I amagainst its [Marijuana’s] legalization, maybe this would help instill fear in its users.” (1%) alignedwith what was suggested in [40]. In a few other occasions, we observed that participants had had achange of heart about the claim’s accuracy: “I’d share because I know that it’s not something thatwould just be out in the open like that and I know that he’s stolen from the government by not payinghis taxes, so I feel like it’s accurate in a sense.” (1%), or that they wanted to provide an addendum tothe claim to help rectify it: “So I could type. "...Inadvertently?" (cough) Because Facebook is evil, notCenobite evil, but corporation-evil. They did that on purpose, they know it, and I know it. The troubleis how few OTHER people know it.” — Claim: Some Phone Cameras Inadvertently Opened WhileUsers Scrolled Facebook App.

Fig. 7. Share rate of true and false headlines across different confidence levels. For true headlines that theycorrectly assess as true, participants are less likely to share content they are less confident about regardlessof whether they are asked for their reasoning. It is on those headlines about which they are more confidentthat requesting reasoning plays a role. For false headlines that they mistakenly assess as true, asking aboutreasoning plays a role in sharing across all confidence levels.

Table 5. Examples of easy, medium, and hard calls around headline accuracy assessed by the perceivedaccuracy of the headline averaged over all participants who assessed the headline. The average perceivedaccuracy is on a scale of 0-1, with 0 indicating that the headline was perceived as inaccurate and 1, as accurate.

Difficulty Headline Veracity Avg. perceived accuracyA Study Showed That Dogs Exhibit Jealousy True 0.90Easy Sipping Water Every 15 Minutes Will Prevent a Coronavirus Infec-tion False 0.05President Trump’s Awarding of a Purple Heart to a Wounded VetWent Unreported by News Media True 0.49Medium Eric Trump Tweeted About Iran Strike Before It Was Made Public False 0.45There Are More Trees on Earth Than Stars in the Milky Way True 0.25Hard Rain That Falls in Smoky Areas After a Wildfire Is Likely to BeExtremely Toxic False 0.62

We observed that asking people to provide reasoning inhibits sharing oftrue as well as false content. We wished to see how sharing behavior across our treatments differswith regards to how confident participants in each condition are in their accuracy assessments. Forinstance, it is possible that asking people for their rationales makes them reluctant to share headlineson whose accuracy they report lower levels of confidence. Figure 7 shows that participants are notlikely to share headlines that they assess as false regardless of whether they are confident abouttheir assessment. For headlines that they assess as true however, in all the treatment conditions,they are less likely to share headlines they are less confident about.We expect that the participants who are asked to provide accuracy assessments or reasons willbe less willing to share headlines about which they initially are more confident, compared to thoseparticipants who are not. This is what we observe for false headlines that participants initiallymisjudged as true. However, surprisingly, for true headlines that participants had correctly judgedas true, the intervention does not reduce sharing at lower confidence levels. It is on those headlinesabout which participants report higher confidence that the reasoning treatment and the reasoncheckboxes play a role.Note that we asked participants to indicate their level of confidence before they provided theirreasons and they could not return to the confidence question once they advanced to the questionrequesting reasoning. It is possible that reflecting on their rationale has lowered their confidence.

It is conceivable that there are headlines that are in factaccurate but sound too outlandish to be true. And conversely, actual false headlines can seemreasonable. While we found that headline veracity does indeed have a strong effect on whetherpeople are willing to share the headline, we wanted to tease apart the ground truth from howaccurate a claim was perceived as according to the wisdom of the crowd. Therefore, we assigned toeach headline an average perceived accuracy metric, which was the average of accuracy assessmentsfrom all the participants who had provided accuracy assessments for the headline, mapping accurateto 1, and inaccurate to 0. With this metric, a headline would no longer have a dichotomous groundtruth, and instead, would have a degree of truth.Overall, actual true headlines had a higher average perceived accuracy compared to the falseones. Table 5 presents examples of headlines that were judged by most participants correctly orincorrectly, and some in between.We built a linear model to explain share likelihood of a headline predicted by its average perceivedaccuracy. The dependent variable of the model was the average of share intentions for a headlinefrom all the participants that had been presented the headline, mapping the 3-item Likert outcomes

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. “Yes”, “Maybe”, and “No” to numeric values as explained in 6.2. The independent variable was theheadline’s average perceived accuracy (continuous). We fit this model to the data from each ofthe control and treatment conditions separately. Because the average perceived accuracy of eachheadline was calculated over the whole dataset, it remained constant across conditions. Shareaverage however, varied in each condition.Figure 8 shows how participants’ sharing intentions differ across conditions as the averageperceived accuracy of a headline increases. In the treatment groups where we asked for accuracyassessment or reasoning in addition to accuracy, the slopes are higher compared to the controlcondition. This observation suggests that the interventions helped people be more differentiatingin sharing of headlines that they perceived as true vs false. The confidence intervals around themean also seem to be narrower for the treatment conditions compared to the control. With thenumbers of participants across treatment conditions being almost similar to or less than that of thecontrol, a narrower confidence interval around the treatment slopes suggests less dispersion anduncertainty in sharing intentions.

Fig. 8. Share likelihood of each headline by its perceived accuracy averaged over all treatment conditions. Theslopes in the treatment groups are higher than the control, suggesting more sharing differentiation betweenheadlines that were perceived as true vs false. The fitted lines in the treatment groups also have narrowerconfidence intervals, suggesting less dispersion and uncertainty in sharing intentions.

We conducted an exploratory analysis of the demographics of participantsand their effect on share likelihood. We have reported these analyses in Appendix B.

This study investigated the effects of two types of interventions on people’s intention of sharingheadlines. The interventions included asking people to assess the accuracy of the headline and toprovide their rationale for why they believe the headline is or is not accurate. We observed that theparticipants who assessed the accuracy of a headline were less likely to indicate they are willing toshare it, compared to those participants whom we did not prompt about accuracy. Although theintervention resulted in curbing of sharing both true and false content, the reduction in sharingof false headlines was higher. Our results corroborate prior findings that nudging people to bemindful of news accuracy increases sharing discernment [53], and enrich our understanding ofhow impactful different nudges will be.We found that asking people to reflect and elaborate on the reason for a headline’s (in)accuracyfurther lowers the probability that they share either true or false content, compared to if they onlyassess the headline’s accuracy. The intervention however, reduces sharing of false content to agreater degree.We observed that the decrease in sharing as a result of requesting people to provide structuredreasoning was almost similar across true and false content (17% for false, 18% for true in our sample).Although this observation suggests that selecting from a checklist of rationales in addition to pro-viding accuracy assessments and free-text reasoning does not seem to help in further differentiatingbetween false and true content, such a checklist can still be used on platforms for the added benefitof capturing and surfacing reasons in structured form.In addition, we examined how sharing intentions differ across conditions as the average perceivedaccuracy of headlines increases. We observed that the interventions caused people to be moredifferentiating in sharing of content that they perceived as true vs false compared to the controlcondition and that the dispersion of sharing decisions in the treatment groups was also lower.Interestingly, Figure 8 shows that the slopes of sharing by perceived accuracy across all conditionsare less than 1, indicating that as the perceived accuracy of a headline increases, its share likelihoodalso increases but at a lower rate. One possible explanation for this phenomenon is that headlinesthat most people agree are true could appear as less interesting and already believed to be knownby the others in one’s social circle.One concern around the generalizability of our results to online platforms is the possible existenceof Hawthorne effect, under which participants change their behavior due to an awareness of beingstudied [34, 41, 58]. We therefore need to understand if participants would exhibit the same behavioras observed in our study if they were placed in a different intervention condition. In a user studywith users recruited from worker platforms, Pennycook et al. investigated how self-reported sharelikelihood of headlines were influenced by an accuracy nudge at the onset of the study whereas a pre-task they asked users to assess the accuracy of a single news item. They found that theparticipants’ likelihood of sharing false headlines compared to true decreased with the intervention,similar to what we observed in our study. However, they reported that asking users to insteadassess the humorousness of a headline did not yield similar results. In another study Pennycooket al. sent an unsolicited message to Twitter users who had recently shared links to websites thatproduced false or hyperpartisan content and asked them to assess the accuracy of a non-politicalheadline. They report that the quality of the news sites that these users shared in the 24 hours afterreceiving the message was increased compared to the sites shared by others who had not receivedthe message [52], suggesting that the effects of these behavioral nudges will indeed generalize tosocial media platforms.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

In addition, other prior work has reported that there is a correlation between hypothetical sharingof news stories reported by survey respondents and actual sharing on social media [46]. Withour nudges proving to be effective and the results generalizing to actual social media platforms,platform designers can implement these nudges to encourage more informed propagation andconsumption of news.As we discuss in Appendix D, the difference in spam rates across conditions may have influencedthe makeup of the data. However, if these interventions are implemented on social media, it isunlikely that we will observe similar spamming behaviors as we do on a paid worker platform.Spamming in lieu of providing legitimate rationales by itself could be a clear signal of the sharer’scredibility to those who will encounter the shared news.It is clear that other factors beyond the perceived accuracy of the headline can affect sharingintentions. Some participant responses indicated that they were reluctant to share a headlinebecause they were not interested in the topic— “It’s not a story that I would share with others, I’m notinterested and my followers wouldn’t be either.” , the headline referred to a sensitive topic and mightcreate controversies— “It is a sensitive topic that I feel strongly one way about and I don’t want tostart disagreements on my page.” , they simply do not share on social media— “I don’t share anythingon social media.” , they did not want to overburden their social media friends with information— “Iwould feel bad clogging up peoples timelines with useless information that is already well known.” ,even though the claim was true, it shed a good light on someone of the opposite party— “It is positiveabout Trump. I would never post anything positive about him.” , or even though the claim was true, itput someone affiliated with their party in a bad light— “I wouldn’t bad mouth Joe Biden” , alignedwith what is reported in prior work [65]. Nevertheless, the primary cited reason for not sharing aheadline was that it was of dubious credibility— “That is just an outright lie. I refuse to contribute tothe misinformation being shared on Facebook.” .Therefore, sharing intentions, or lack thereof, can be used as a proxy to gain insight into howaccurate participants believed each headline to be post interventions. Although we did requestaccuracy assessments across the two treatment groups where participants provided reasoning, theseaccuracy assessments were done at the onset of each item and before participants were subjectedto the reasoning interventions. In addition, the control group did not provide accuracy assessments.It is by contrasting the sharing intentions and not accuracy assessments that we understand theeffects of our interventions. Figure 8 confirms this paradigm that share probability increases withthe perceived accuracy of a claim.

We have discussed the two studies in their respective sections; here we offer more general observa-tions.The behavioral nudges that we tested in this study, providing accuracy assessment and rationalefor whether a news story is or is not accurate, proved to be effective at reducing the ratio of falsecontent to true that participants indicated they were willing to share. The reduction of shared falseheadlines came at the cost of also curbing the sharing of true ones. However, we still believe thatthe outcome is preferable to leaving misinformation unchecked and unchallenged. Indeed, it isconceivable that platforms can benefit from an overall improved engagement if the feeds that theypresent to their users become more reliable. In addition, even if approaches similar to the nudges inour study result in loss of some profit, the implications that unmuddied online information spaceshave for the society may warrant persuading platforms, through activism or legislation, to adoptthem.

While platform moderators and fact-checkers play a valuable role in flagging and removing contentthat has already spread and become visible, other measures are needed to restrain sharing ofmisinformation as it is being handed from user to user. In addition, although policy-driven platformmoderation is necessary in some contexts, communities should be wary of relinquishing all thepower of content filtering and highlighting to the platforms whose incentives as for-profit entitiesrunning on ads do not necessarily align with the users’ [26, 32]. The challenge of moderation isexacerbated as not all accounts of problematic behavior or posts can be provisioned a priori inplatform policies, leading moderators to make ad-hoc decisions in grey areas that sometimes drawcriticism [2, 30, 66, 68].These challenges suggest that the problem of misinformation could additionally be tackled at theuser level. The interventions that we studied in this work aim at this problem by first, introducinga barrier, albeit low, to posting and sharing, and second, shifting users’ attention to accuracyand away from the social feedback and engagement that they would receive at posting time. Ourinterventions can be used in the existing social media platforms such as Twitter and Facebookand are aligned with the initiatives they have already undertaken to combat the proliferation ofmisinformation. However, the usefulness of these nudges is not limited to these platforms as theycan be leveraged in alternative platforms with different publishing models, such as WikiTribunewhere users curate content collectively [49].We envision that requesting reasoning on social media or content sharing platforms can be donein a similar fashion to how emotions and reactions are currently captured on the existing socialmedia or how users cite references when developing content on wiki-based platforms. Structuredreasons provided by users for or against a post’s accuracy can serve as rich metadata based on whichother users can filter the posts they would want to view. In such a scenario, a user might chooseto view only those articles that have been evaluated as true by a friend because the evaluator hasasserted they have domain knowledge on the subject. The taxonomy that we developed originatesfrom people untrained in credibility signifiers often developed by experts, and we determined inour studies that other untrained people can reliably use it to provide their rationales. Therefore,the adoption of requesting rationales via a checklist similar to that of our study on social mediadoes not appear to incur a barrier to entry for users, except maybe in forcing some degree ofdeliberation which is desirable. Prior work reports that the effectiveness of fact-checking dependson the relationship of the user offering the fact-checking information with the user requestingit or one who has produced an inaccurate post [29, 38]. Therefore, by incorporating accuracyassessments of our study in platforms and making them visible and accessible to users, we hopesocial media friends can benefit from them.

One direction for future work is to examine how users react to posts accompanied by such rationaletags as the ones in our study and what factors they consider when deciding the credibility of a postor the persuasiveness of its tagged rationales. However, because who the sharer of a post is can alsoimpact perceived content credibility [20], such a study would be more informative if conducted asa field study on a social network, rather than a controlled experiment.Our work examined the effects of accuracy assessment and reasoning nudges on content sharingwhen users are required to provide them. Future work can investigate the effects of allowing usersto optionally provide these signals, similar to how “likes” and “upvotes” are captured on socialmedia.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Although one reason why people would share a story was to inform the others, we found thatthere are various reasons they might share even a headline they do not necessarily believe to beaccurate. Such reasons include because the article is entertaining or to ask their social circle tohelp them with fact-checking the story. Future work could investigate how providing a checklist ofthese sharing intentions in a fashion similar to our study would affect sharing behaviors on socialmedia and how it would impact the consumption of posts on which these signals are provided bythe users who encounter them.One interesting observation from our exploratory analyses in Appendix B was that our Republicanparticipants were more likely to share false claims compared to our Democratic participants. Whileour analyses were not planned a priori, they are bolstered by prior studies that have reportedsimilar findings [27]. These observations give rise to interesting questions that could be examined infuture work, such as whether behavioral nudges should be applied indiscriminately to all platformusers or only those who have been found to habitually share misinformation, or whether usersshould be primed every time they intend to share posts or if it is sufficient to apply the nudgesonly occasionally.

In condition 4 of the Taxonomy study where we presented the checklist of reason categories, we alsoasked participants to explain their choices in free-text to examine how they had used the categoriesand if they had truly understood their intended use. With this model, comparing this condition withcondition 3 where participants provided their reasoning via free-text gives us insight into whetherrestricting people’s rationales to the taxonomy framework works as well as otherwise allowingthem to provide unstructured reasons. However, our study did not include another condition whereparticipants needed to only work through the checklist without elaborating on their choices. Theinclusion of such a condition would have allowed us to determine whether the checklist of reasonscan replace free-text entirely and induce the same level of discernment. Not requiring free-text butrather providing it as optional in addition to the checklist could potentially be more desirable forthe adoption of this strategy on social media. Future work can include a condition of this nature.

10 Conclusion

In this work, we explored how to alter social media platforms such that users would have contentaccuracy in mind upon sharing posts. We explored nudges that could be used at scale intendedto encourage users to think whether a post is accurate and their rationales for believing so. Tofacilitate capturing people’s rationales in a structured format and help the adoption of the nudgeson social media, we conducted a study where we developed a taxonomy of reasons why peoplebelieve or disbelieve news claims. That study involved presenting news claims to people as wellas the taxonomy and asking them to use the reason categories to provide their rationales forthe (in)accuracy of the claims. We conducted multiple iterations of the study while revising thetaxonomy until participants could reliably use it to label their responses.We then examined the effects of two different nudges, accuracy assessment and providingreasoning for why a news story is or is not accurate, on people’s sharing intentions on social media.We found that both nudges reduce sharing of true and false content, but the decrease in sharing offalse content was higher. Our findings on the effects of the accuracy and reasoning nudges offerimplications for social media platform designers on how to mitigate sharing of false information.Furthermore, these platforms can ask their users to provide accuracy assessments for the poststhe users share by guiding them through the taxonomy categories that we developed in the study.These structured reasons could potentially help those who encounter the post e.g., by enabling

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:27 them to filter their newsfeed based on different reasons that post sharers have specified, arguingfor or against a post’s accuracy.

11 Acknowledgments

We would like to thank Ezra Karger, Ali Kheradmand, and Mohammad Amin Nabian for theirvaluable feedback regarding the statistical analyses.

References [1] [n.d.].

Combatting Vaccine Misinformation - About Facebook . https://about.fb.com/news/2019/03/combatting-vaccine-misinformation/[2] [n.d.].

Facebook apologises for blocking Prager University’s videos

Facebook is ditching its own solution to fake news because it didn’t work . https://qz.com/1162973/to-fight-fake-news-facebook-is-replacing-flagging-posts-as-disputed-with-related-articles/[4] [n.d.].

QAnon and the storm of the U.S. Capitol: The offline effect of online conspiracy theo-ries . https://theconversation.com/qanon-and-the-storm-of-the-u-s-capitol-the-offline-effect-of-online-conspiracy-theories-152815[7] Natalya N Bazarova, Yoon Hyung Choi, Victoria Schwanda Sosik, Dan Cosley, and Janis Whitlock. 2015. Social sharingof emotions on Facebook: Channel differences, satisfaction, and replies. In

Proceedings of the 18th ACM conference oncomputer supported cooperative work & social computing . 154–164.[8] Shashank Bengali. 2019. How WhatsApp is battling misinformation in India, where ’fake news is part of our culture’. (2019).[9] Md Momen Bhuiyan, Amy X Zhang, Connie Moon Sehat, and Tanushree Mitra. 2020. Investigating Differences inCrowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria.

Proceedings of the ACM on Human-Computer Interaction

4, CSCW2 (2020), 1–26.[10] Leticia Bode and Emily K Vraga. 2018. See something, say something: Correction of global health misinformation onsocial media.

Health communication

33, 9 (2018), 1131–1140.[11] Alexandre Bovet and Hernán A Makse. 2019. Influence of fake news in Twitter during the 2016 US presidential election.

Nature communications

10, 1 (2019), 1–14.[12] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credibility on twitter. In

Proceedings of the20th international conference on World wide web . 675–684.[13] Kathy Charmaz and Linda Liska Belgrave. 2007. Grounded theory.

The Blackwell encyclopedia of sociology (2007).[14] Edward T Cokely, Mirta Galesic, Eric Schulz, Saima Ghazal, and Rocio Garcia-Retamero. 2012. Measuring risk literacy:The Berlin numeracy test.

Judgment and Decision making (2012).[15] Alistair Coleman. [n.d.]. ’Hundreds dead’ because of Covid-19 misinformation

Proceedings of the National Academy ofSciences

Harvard Kennedy School Misinformation Review

1, 1 (2020).[18] Pranav Dixit and Ryan Mac. 2018. How WhatsApp Destroyed A Village.

Buzzfeed News (2018).[19] Ziv Epstein, Gordon Pennycook, and David Rand. 2020. Will the crowd game the algorithm? Using layperson judgmentsto combat misinformation on social media by downranking distrusted sources. In

Proceedings of the 2020 CHI Conferenceon Human Factors in Computing Systems . 1–11.[20] Martin Flintham, Christian Karner, Khaled Bachour, Helen Creswick, Neha Gupta, and Stuart Moran. 2018. Fallingfor fake news: investigating the consumption of news via social media. In

Proceedings of the 2018 CHI Conference onHuman Factors in Computing Systems . 1–10.[21] Shane Frederick. 2005. Cognitive reflection and decision making.

Journal of Economic perspectives

19, 4 (2005), 25–42.[22] Christine Geeng, Savanna Yee, and Franziska Roesner. 2020. Fake News on Facebook and Twitter: Investigating HowPeople (Don’t) Investigate. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems . 1–14.[23] Lucas Graves. 2016.

Deciding what’s true: The rise of political fact-checking in American journalism . Columbia UniversityPress. Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. [24] Nir Grinberg, Kenneth Joseph, Lisa Friedland, Briony Swire-Thompson, and David Lazer. 2019. Fake news on Twitterduring the 2016 US presidential election.

Science

Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing .726–739.[26] Jennifer Grygiel and Nina Brown. 2019. Are social media companies motivated to be good corporate citizens?Examination of the connection between corporate social responsibility and social media safety.

TelecommunicationsPolicy

43, 5 (2019), 445–460.[27] Andrew Guess, Jonathan Nagler, and Joshua Tucker. 2019. Less than you think: Prevalence and predictors of fakenews dissemination on Facebook.

Science advances

5, 1 (2019), eaau4586.[28] Maria Haigh, Thomas Haigh, and Tetiana Matychak. 2019. Information Literacy vs. Fake News: The Case of Ukraine.

Open Information Science

3, 1 (2019), 154–165.[29] Aniko Hannak, Drew Margolin, Brian Keegan, and Ingmar Weber. 2014. Get Back! You Don’t Know Me Like That: TheSocial Mediation of Fact Checking Interventions in Twitter Conversations.. In

ICWSM .[30] Yasmin Ibrahim. 2017. Facebook and the Napalm Girl: reframing the iconic as pornographic.

Social Media+ Society

3, 4(2017), 2056305117743140.[31] Alireza Karduni, Isaac Cho, Ryan Wesslen, Sashank Santhanam, Svitlana Volkova, Dustin L Arendt, Samira Shaikh,and Wenwen Dou. 2019. Vulnerable to misinformation? Verifi!. In

Proceedings of the 24th International Conference onIntelligent User Interfaces . 312–323.[32] Makena Kelly. [n.d.].

Facebook proves Elizabeth Warren’s point by deleting her ads about breaking up Face-book

Proceedings of the Eleventh ACMInternational Conference on Web Search and Data Mining . 324–332.[34] Frauke Kreuter, Stanley Presser, and Roger Tourangeau. 2008. Social desirability bias in cati, ivr, and web surveystheeffects of mode and question sensitivity.

Public opinion quarterly

72, 5 (2008), 847–865.[35] Travis Kriplean, Caitlin Bonnar, Alan Borning, Bo Kinney, and Brian Gill. 2014. Integrating on-demand fact-checkingwith public dialogue. In

Proceedings of the 17th ACM conference on Computer supported cooperative work & socialcomputing . 1188–1199.[36] J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159–174.[37] Farhad Manjoo. 2013. You won’t finish this article.

Why people online don’t read to the end: Slate (2013).[38] Drew B Margolin, Aniko Hannak, and Ingmar Weber. 2018. Political fact-checking on Twitter: When do correctionshave an effect?

Political Communication

35, 2 (2018), 196–219.[39] Cameron Martel, Mohsen Mosleh, and David Gertler Rand. 2021. You’re definitely wrong, maybe: Correction style hasminimal effect on corrections of misinformation online.

Media and Communication

9, 1 (2021).[40] Alice E Marwick. 2018. Why do people share fake news? A sociotechnical model of media effects.

Georgetown LawTechnology Review

2, 2 (2018), 474–512.[41] Jim McCambridge, John Witton, and Diana R Elbourne. 2014. Systematic review of the Hawthorne effect: new conceptsare needed to study research participation effects.

Journal of clinical epidemiology

67, 3 (2014), 267–277.[42] Meredith Ringel Morris, Scott Counts, Asta Roseway, Aaron Hoff, and Julia Schwarz. 2012. Tweeting is believing?Understanding microblog credibility perceptions. In

Proceedings of the ACM 2012 conference on computer supportedcooperative work . 441–450.[43] Mohsen Mosleh, Cameron Martel, Dean Eckles, and David G. Rand. 2021. Perverse Downstream Consequencesof Debunking: Being Corrected by Another User for Posting False Political News Increases Subsequent Sharing ofLow Quality, Partisan, and Toxic Content in a Twitter Field Experiment. In

To appear in proceedings of the 2021 CHIConference on Human Factors in Computing Systems .[44] Mohsen Mosleh, Cameron Martel, Dean Eckles, and David G Rand. 2021. Shared partisanship dramatically increasessocial tie formation in a Twitter field experiment.

Proceedings of the National Academy of Sciences

Nature Communications

12, 1 (2021), 1–10.[46] Mohsen Mosleh, Gordon Pennycook, and David G Rand. 2020. Self-reported willingness to share political news articlesin online surveys correlates with actual sharing on Twitter.

Plos one

15, 2 (2020), e0228882.[47] Samuel Murray, Matthew Stanley, Jonathon McPhetres, Gordon Pennycook, and Paul Seli. 2020. " I’ve said it beforeand I will say it again": Repeating statements made by Donald Trump increases perceived truthfulness for individualsacross the political spectrum. (2020).Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:29 [48] Onook Oh, Kyounghee Hazel Kwon, and H Raghav Rao. 2010. An Exploration of Social Media in Extreme Events:Rumor Theory and Twitter during the Haiti Earthquake 2010.. In

Icis , Vol. 231. 7332–7336.[49] Sheila O’Riordan, Gaye Kiely, Bill Emerson, and Joseph Feller. 2019. Do you have a source for that? Understanding theChallenges of Collaborative Evidence-based Journalism. In

Proceedings of the 15th International Symposium on OpenCollaboration . 1–10.[50] Gordon Pennycook, Adam Bear, Evan T Collins, and David G Rand. 2020. The implied truth effect: Attaching warningsto a subset of fake news headlines increases perceived accuracy of headlines without warnings.

Management Science (2020).[51] Gordon Pennycook, Tyrone D Cannon, and David G Rand. 2018. Prior exposure increases perceived accuracy of fakenews.

Journal of experimental psychology: general

Nature (2021).[53] Gordon Pennycook, Jonathon McPhetres, Yunhao Zhang, Jackson G Lu, and David G Rand. 2020. Fighting COVID-19misinformation on social media: Experimental evidence for a scalable accuracy-nudge intervention.

Psychologicalscience

31, 7 (2020), 770–780.[54] Gordon Pennycook and David G Rand. 2019. Fighting misinformation on social media using crowdsourced judgmentsof news source quality.

Proceedings of the National Academy of Sciences

Cognition

188 (2019), 39–50.[56] Julie Posetti and Alice Matthews. 2018. A short guide to the history of ‘fake news’ and disinformation.

InternationalCenter For Journalists (2018), 2018–07.[57] Martin Potthast, Sebastian Köpsel, Benno Stein, and Matthias Hagen. 2016. Clickbait detection. In

European Conferenceon Information Retrieval . Springer, 810–817.[58] Chris Preist, Elaine Massung, and David Coyle. 2014. Competing or aiming to be average? Normification as a means ofengaging digital volunteers. In

Proceedings of the 17th ACM conference on Computer supported cooperative work & socialcomputing . 1222–1233.[59] Walter Quattrociocchi, Antonio Scala, and Cass R Sunstein. 2016. Echo chambers on Facebook.

Available at SSRN2795110 (2016).[60] John Reed. 2018. Hate speech, atrocities and fake news: The crisis of democracy in Myanmar. (2018).[61] Ana Lucía Schmidt, Fabiana Zollo, Antonio Scala, Cornelia Betsch, and Walter Quattrociocchi. 2018. Polarization ofthe vaccination debate on Facebook.

Vaccine

36, 25 (2018), 3606–3612.[62] Scott Shane. 2017. The fake Americans Russia created to influence the election.

The New York Times

7, 09 (2017).[63] Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Kai-Cheng Yang, Alessandro Flammini, and Filippo Menczer.2018. The spread of low-credibility content by social bots.

Nature communications

9, 1 (2018), 1–9.[64] Jieun Shin, Lian Jian, Kevin Driscoll, and François Bar. 2017. Political rumoring on Twitter during the 2012 USpresidential election: Rumor diffusion and correction. new media & society

19, 8 (2017), 1214–1235.[65] Jieun Shin and Kjerstin Thorson. 2017. Partisan selective sharing: The biased diffusion of fact-checking messages onsocial media.

Journal of Communication

67, 2 (2017), 233–255.[66] Robert Shrimsley. [n.d.].

Facebook photos: snap judgments

ACM SIGKDD Explorations Newsletter

19, 1 (2017), 22–36.[68] Sara Spray. [n.d.].

Facebook Is Embroiled In A Row With Activists Over “Censorship”

ICWSM . 365–374.[70] Kate Starbird, Dharma Dailey, Owla Mohamed, Gina Lee, and Emma S Spiro. 2018. Engage early, correct more: Howjournalists participate in false rumors online during crisis events. In

Proceedings of the 2018 CHI conference on humanfactors in computing systems . 1–12.[71] Kate Starbird, Jim Maddock, Mania Orand, Peg Achterman, and Robert M Mason. 2014. Rumors, false flags, and digitalvigilantes: Misinformation on twitter after the 2013 boston marathon bombing.

IConference 2014 Proceedings (2014).[72] Paul Steinhauser. [n.d.].

Arizona certifies Biden as election winner, with Wisconsin following hours later

Qualitative analysis for social scientists . Cambridge university press.[74] S Shyam Sundar. 1998. Effect of source attribution on perception of online news stories.

Journalism & Mass Communi-cation Quarterly

75, 1 (1998), 55–68.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. [75] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online.

Science

ScienceCommunication

39, 5 (2017), 621–645.[77] Emily K Vraga and Leticia Bode. 2020. Defining misinformation and understanding its bounded nature: using expertiseand evidence for describing misinformation.

Political Communication

37, 1 (2020), 136–144.[78] Claire Wardle and Hossein Derakhshan. 2017. Information disorder: Toward an interdisciplinary framework forresearch and policy making.

Council of Europe report

27 (2017).[79] Sam Wineburg and Sarah McGrew. 2017. Lateral reading: Reading less and learning more when evaluating digitalinformation. (2017).[80] Liang Wu, Jundong Li, Xia Hu, and Huan Liu. 2017. Gleaning wisdom from the past: Early detection of emergingrumors in social media. In

Proceedings of the 2017 SIAM international conference on data mining . SIAM, 99–107.[81] Waheeb Yaqub, Otari Kakhidze, Morgan L Brockman, Nasir Memon, and Sameer Patil. 2020. Effects of CredibilityIndicators on Social Media News Sharing Intent. In

Proceedings of the 2020 CHI Conference on Human Factors inComputing Systems . 1–14.[82] Amy X Zhang, Aditya Ranganathan, Sarah Emlen Metz, Scott Appling, Connie Moon Sehat, Norman Gilmore, Nick BAdams, Emmanuel Vincent, Jennifer Lee, Martin Robbins, et al. 2018. A structured response to misinformation:Defining and annotating credibility indicators in news articles. In

Companion Proceedings of the The Web Conference2018 . 603–612.[83] Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Peter Tolmie. 2016. Analysing how peopleorient to and spread rumours in social media by looking at conversational threads.

PloS one

11, 3 (2016), e0150989.

A Association Between Participants’ Demographics and the Taxonomy Categories.

We performed an exploratory analysis on the data from our Taxonomy study to understand whetherthe demographics of participants influence the types of rationales they give for why they believea claim is or is not accurate. This analysis was not part of our study design and was added as astepping stone for future work. We limited our analyses to the data obtained from the last iterationof the study because the categories that the participants and the research team coder had used inthe prior iterations had changed. We further excluded those datapoints for which we did not havethe gender, party, and ethnicity of the participant, resulting in 953 datapoints of which 645 hadfree-text elaboration (did not belong to the

Don’t know categories).For party, each participant was labeled either a Democratic or a Republican. We were able to placeall participants including those who identified as Independent or other (e.g., Green) in one of thetwo Democratic or Republican categories because in addition to party, we had asked participantsabout their political preference (strongly Republican, lean Republican, Republican, Democrat, leanDemocrat, strongly Democrat). Because the majority of our participants were White, ethnicitywas given the values of White, and Not White. With respect to education, we categorized theparticipants as having a college degree including Associate’s degree vs not.We then performed Chi-square tests of independence on the contingency tables of rationalesand each of the demographic factors of party, gender, and ethnicity. We caution that these testswere underpowered considering the number of categories and our sample size and future studiesare needed to ascertain whether our results hold true with a larger sample. The tests did not finda statistically significant association between rationales that participants gave and their party orethnicity [ 𝜒 ( ) = . 𝑝 = .

12 for party; 𝜒 ( ) = . 𝑝 = .

47 for ethnicity]. However, therationales were not independent of the participants’ gender [ 𝜒 ( ) = . 𝑝 = . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:31

Fig. 9. Distributions of rationales by the gender of the user for claims perceived as accurate and inaccurate.Each bar shows the ratio of the rationale category relative to all rationales asked by users of the same gender.

B Effects of the Demographics of the Nudge Study Participants

We performed an exploratory analysis of the demographics of the Nudge study participants tounderstand what role these factors play in participants’ decision of whether to share headlines. Wehad not planned for these analyses a priori in our experiment design, but later included them forcompleteness. Thus, this analysis should be considered exploratory, and 𝑝 -values not indicativeof true significance. We developed a linear model with share intention as the dependent variable, Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Fig. 10. Distributions of rationales by the gender of the user for other signals of credibility besides accuracy.Each bar shows the ratio of the rationale category relative to all rationales asked by users of the same gender. with outcomes “Yes”, “Maybe”, and “No” mapped to numeric values as explained in 6.2, in additionto several demographics factors:share ∼ veracity × ( partisanship concordance × ( accuracy condition + reasoning condition + reasoning format) + party + gender + age + ethnicity + education ) + ( | participant ) + ( | claim ) (3)Similar to the model in 6.2, we included the veracity of the headline and treatment conditions asindependent variables, and claim and participant as random effects. We limited this analysis tothat portion of the data for which we had the complete demographics information required forour model, excluding 638 datapoints. In addition, we excluded 46 datapoints from the participantswho had identified as neither male nor female because these datapoints were too few for fitting themodel.We treated party, ethnicity, and education similar to Appendix A. We binned age into 7 buckets.We did not include the interaction between the demographic factors because given our sample size,we did not have enough power to do so.In the model, we also included partisanship concordance which was a measure of the alignmentbetween the participants’ self-declared party and the partisanship rating that they had given tothe headline (measured on a 5-item Likert scale). This value ranged from 1-5 with 1 indicating noalignment, and 5 complete alignment.concordance = ( partisanship rating of the headline ) × ( rater party == Republican )+( − partisanship rating of the headline ) × ( rater party == Democratic ) (4)Because this model includes several demographic factors that may capture some degree ofvariance in share likelihood, the effects of the treatments observed in this more refined modelcan serve as a confirmation of the results in 6.2. The results obtained for the demographic factors Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:33

Fig. 11. Predicted values (marginal effects) for share likelihood as concordance (alignment between headlineand participant partisanship) increases obtained from the model with demographics included as independentvariables. however, should be taken with caution and further examined in future work, as these were notplanned analyses.We performed a Wald Chi-Square test on the fitted model to determine which of the factors hada significant effect. Consistent with the results in 6.2, the effects of veracity, providing accuracy,providing reasoning, and reasoning format were significant [ 𝜒 ( ) = . 𝑝 < .

001 for veracity, 𝜒 ( ) = . 𝑝 < .

001 for providing accuracy, 𝜒 ( ) = . 𝑝 < .

01 for providing reasoning, 𝜒 ( ) = . 𝑝 = .

03 for reasoning format]. The interaction between veracity and whether theparticipant was asked about accuracy was also significant at the 𝛼 = .

05 level [ 𝜒 ( ) = . 𝑝 = . B.0.0.1 Concordance had a statistically significant effect on sharing intentions [ 𝜒 ( ) = . 𝑝 < . 𝜒 ( ) = . 𝑝 < . 𝜒 ( ) = . 𝑝 < . B.0.0.2 Party had a significant effect on sharing intentions [ 𝜒 ( ) = . 𝑝 < . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Fig. 12. The alignment of a headline’s partisan-ship with the participant’s increases the likeli-hood of sharing slightly more when the headlineis true compared to when it is false. Fig. 13. Asking users to assess the accuracy of head-lines restrains their sharing of headlines that are well-aligned with their partisanship.Fig. 14. Sample means of share likelihoodby party. Republican participants were morelikely to share headlines. Fig. 15. Sample means of share likelihood by partyand headline veracity. Democratic participants wereless likely to share false headlines compared to Re-publicans. party and headline veracity also had a significant effect [ 𝜒 ( ) = . 𝑝 < . B.0.0.3 Gender had a significant effect on sharing intentions [ 𝜒 ( ) = . 𝑝 < . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:35

Fig. 16. Sample means of share likelihood bygender. Males are slightly more likely to shareheadlines compared to females. Fig. 17. Sample means of share likelihood by edu-cation. Participants who held a college degree weremore likely to share headlines compared. The dif-ference in share likelihood however, is small.

B.0.0.4 Education had a significant effect on likelihood of sharing at 𝛼 = .

05 level [ 𝜒 ( ) = . 𝑝 = . C Cumulative Link Models

We tested the effects of our interventions on share intentions using the same formula as outlinedin Section 6.2, but using cumulative link mixed models instead of linear models. Cumulative linkmodels are appropriate for fitting ordinal values and find the cumulative probability of the i thrating (datapoint) falling in the j th category or below. The categories in our data are ordered sharedecisions “No” , “Maybe” , and “yes” . The cumulative link model assumes that there is a continuousbut unobservable variable 𝑌 𝑖 with a mean that depends on the predictors and that this underlyingdistribution has a set of cut-points 𝜃 , 𝜃 , ..., 𝜃 𝑗 where if 𝜃 𝑘 < 𝑌 𝑖 < 𝜃 𝑘 + , the manifest response(share decision) will take the value 𝑘 .Similar to the linear models from 6.2, we developed a veracity model in which the independentvariables were the main effects of the objective veracity of the headlines and our treatments (askingabout accuracy, asking about reasoning, presenting checkboxes vs free-text to capture reasons),as well as the interaction between veracity and the treatments. In addition, we developed thecumulative link counterpart to the perceived accuracy model in 6.2, which was fit to the data fromthe treatment conditions. In this model, the independent variables were participant’s assessment ofthe accuracy of the headline, whether participants were asked to provide reasoning and whetherthey were presented with checkboxes, as well as the interaction between these treatments andaccuracy assessment. In both models we included participant and claim as random effects.To fit these models, we used the function “clmm” with a “logit” link from the package “ordinal”in R and set the threshold as symmetric. We then performed Likelihood Ratio Chi-Square tests(function “Anova” from package “RVAideMemoire”) on each of the fitted models to determinewhether the effects of the independent variables were significant. If we determined a factor wassignificant, we then performed a post-hoc Estimated Marginal Means (EMMeans) test across the Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. levels of the factor of interest averaging over all other factors. We used the function “emmeans”from the R package “emmeans” with mode “mean.class” to obtain and compare the expected valuesof the ordinal response on a scale of 1 to 3 (the number of categories) for each of the levels of thefactor of interest. P values were adjusted with Tukey method to account for multiple comparisons.The results of the cumulative link models were consistent with the results obtained from the linearmodels in 6.2.Similar to the results we observed for the linear model counterparts, the effects of veracity inthe veracity model and perceived accuracy in the perceived accuracy model were both significant[ 𝜒 ( ) = . 𝑝 < .

001 for veracity model, 𝜒 ( ) = . 𝑝 < .

001 for perceived accuracymodel]. The EMMeans showed that participants were more likely to have the intention of sharingobjectively true rather than false headlines [ 𝑧 = . 𝑝 < . 𝐸 ( 𝐹𝑎𝑙𝑠𝑒 ) = . 𝐸 ( 𝑇𝑟𝑢𝑒 ) = . 𝑧 = . 𝑝 < . 𝐸 ( Perceived as false ) = . 𝐸 ( Perceived as true = . 𝜒 ( ) = . 𝑝 < . 𝑧 = . 𝑝 < . 𝐸 ( Accuracy not provided ) = . 𝐸 ( Accuracy provided ) = . 𝜒 ( ) = . 𝑝 < . 𝜒 ( ) = . 𝑝 < .

001 for perceived accuracy; reasoning format: 𝜒 ( ) = . 𝑝 = .

01 for veracity, 𝜒 ( ) = . 𝑝 = .

02 for perceived accuracy]. Participants were more likelyto share headlines if they were not asked to provide their reasoning about why the claim was or wasnot accurate [ 𝑧 = . 𝑝 < . 𝐸 ( Reasoning not provided ) = . 𝐸 ( Reasoning provided ) = . 𝑧 = . 𝑝 < . 𝐸 ( Reasoning not provided ) = . 𝐸 ( Reasoning provided ) = . 𝑧 = . 𝑝 = . 𝐸 ( Free-text ) = . 𝐸 ( Checkbox ) = .

32 forveracity; 𝑧 = . 𝑝 = . 𝐸 ( Free-text ) = . 𝐸 ( Checkbox ) = .

27 for perceived accuracy].The veracity model also indicated that the interaction between veracity and providing accu-racy is statistically significant [ 𝜒 ( ) = . 𝑝 < . 𝐸 ( False, Accuracy not provided ) = . 𝐸 ( False, Accuracy provided ) = . 𝐸 ( True, Accuracy not provided ) = . 𝐸 ( True, accuracy provided ) = . D Investigation of Potential Confounds in the Nudge StudyD.1 Makeup of Data and Impact of Removing Spams.

The task in the Nudge study presented 10 claims to each participant but because some participantsabandoned the task before its conclusion, we had fewer data from them. It is conceivable that if theattrition rate is different across conditions, then the conditions differ not only in what treatmentthey received, but also in what type of people contributed more data to each condition. Therefore,we probed how many participants per condition did not finish all the 10 headlines. This numberacross all conditions was in the range of 30-50, suggesting that the dropout rate was similar. Wethen analyzed the spam rate across conditions which was more variant (condition 1: 145 , condition2: 114 , condition 3: 136, condition 4: 174). It is possible that different interventions result in differentspam rates and that those participants who stay and work through a more laborious condition, arein fact, characteristically different from those who finish the task by spamming.We performed a Pearson’s Chi Square test to investigate if the distribution of spams was differentfrom a uniform distribution. The difference was statistically significant [ 𝜒 ( ) = . 𝑝 = . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021. xploring Lightweight Interventions at Posting Time to Reduce the Sharing of Misinformation on Social Media 18:37 across conditions. We then analyzed the share rate in spams across different conditions which wassimilar (see Table 6), with share mean of approximately 0.72 across all conditions regardless ofheadline veracity.In the main section of the paper, we have presented our findings above excluding the Spams.However, we perform the same analyses including the spam datapoints in the Appendix section E.Some of our results that pertain to conditions that have heavier interventions are no longerstatistically significant when spams are included. The reason is that in these conditions, the sharerates are low and therefore the difference between conditions is smaller but detectable in theabsence of noise. Including noise, i.e., spams, in a condition, increases the sample size while addinga relatively large number of positive datapoints, or datapoints that indicate a positive intention ofsharing, for both true and false headlines. The difference that existed before will now be diluted.

D.2 Deliberation Priming.

Although in the control condition for the Taxonomy study we did not have any of the accuracyand reasoning nudges, after each sharing decision, we asked participants why they would orwould not share the article. This question by itself may have acted as a deliberation prime onsubsequent sharing decisions. To test this hypothesis, we developed a model with share intentionas the dependent variable and veracity and whether the item was the first item presented to theuser as independent variables and included participant identifier as a random effect. We fit themodel to the first and last datapoints that participants in the control condition provided. We foundthat as expected, veracity was positively correlated with sharing intentions and the correlation wasstatistically significant [ 𝛽 = . , 𝑝 < . 𝛽 = . , 𝑝 = . 𝛽 = − . , 𝑝 = . .

16 for the last decision, 0 . − . = .

08 for the first]. Thisobservation gives some degree of support to the hypothesis that simply asking users to ponder overtheir sharing decision may have primed them to be mindful of the headline’s accuracy over time.

D.3 Investigating Potential Learning Effects of Repeated Accuracy Assessments

We wished to investigate whether making repeated judgements about accuracy had a learning effecton participants leading their subsequent accuracy assessments to be closer to the headline’s actualveracity. Therefore, we fit the following model to the first and last datapoints that participants inthe treatment conditions had provided:accuracy assessment == veracity ∼ is first question × ( reasoning condition + reasoning format )+ ( | participant ) + ( | claim ) (5)Because the outcome of the model was dichotomous (1 if veracity and perceived accuracymatched, 0 if they did not), we used the function “glmer” with link “logit” from the R package“lme4” to fit the data. A Wald Chi-Square test on the model revealed that the effect of whether theparticipant’s judgement was the first or the last was in fact not significant [ 𝜒 ( ) = . 𝑝 = . E Spams

In this section, we include the spam datapoints of the Nudge study in the dataset and perform thesame analyses that we conducted in the Results section.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Table 6. Share means in spam entries across experimental conditions and headline veracity.

Veracity. Cond. 1 Cond. 2 Cond. 3 Cond. 4True 0.76 0.70 0.67 0.70False 0.74 0.72 0.71 0.73The Wald-Chi Square tests fitted to our linear models revealed that the effect of veracity in theveracity model and the effect of perceived accuracy in the perceived accuracy model on sharingintention were both significant [ 𝜒 ( ) = . 𝑝 < .

001 for veracity, 𝜒 ( ) = . 𝑝 < . 𝑧 = . , 𝑝 < . 𝑧 = . , 𝑝 < . E.1 Effect of Providing Accuracy Assessments

Consistent with the results we observed when excluding spams, providing accuracy assessmenthad a significant effect on sharing intentions for the veracity model [ 𝜒 ( ) = . 𝑝 < . E.2 Effect of Providing Reasoning

The effect of reasoning on sharing intention when including spams was not significant in eitherthe veracity or the perceived accuracy models [ 𝜒 ( ) = . 𝑝 = .

53 for the veracity model, 𝜒 ( ) = . 𝑝 = .

48 for the perceived accuracy model]. Similarly, the interaction effect ofreasoning and veracity was not significant in the veracity model [ 𝜒 ( ) = . 𝑝 = . 𝜒 ( ) = . 𝑝 = . E.3 Effect of Reasoning Format

We observed that the effect of reasoning form when including spams was not significant in eitherthe veracity or the perceived accuracy models [ 𝜒 ( ) = . 𝑝 = .

43 in the veracity model, 𝜒 ( ) = . 𝑝 = .

47 in the perceived accuracy model]. The effect of the interaction between veracityand reasoning form was not significant either [ 𝜒 ( ) = . 𝑝 = . 𝜒 ( ) = . 𝑝 < . F Headlines Used in the Study of Behavioral Nudges

We present the headlines that we used in the user study of behavioral nudges along with theirveracity and partisanship. Partisanship of a headline was rated by each participant that was

Fig. 18. Share rate of true and false headlines across study conditions including spam datapoints. The resultssuggest that people are less likely to share both accurate and inaccurate content if they are asked to assessthe content’s accuracy although the reduction in shared false content is higher (condition 1 vs 2). However,asking people to provide their reasoning in addition to assessing accuracy does not result in a statisticallysignificant difference compared to if they only assess accuracy (condition 2 vs 3). Similarly, there does notexist a statistically significant difference in means of sharing true and false content across the reasoningformat conditions (condition 3 vs 4). presented the headline in the study. The partisanship measure in the table is an average over allthese ratings on scale of -2 (more favorable for Democrats) to 2 (more favorable for Republicans).

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 18. Publication date: April 2021.

Fig. 19. Share rate of headlines across study conditions including spam datapoints for headlines that wereperceived as true or false. In condition 3 where participants where asked about their rationales, the sharemean for headlines perceived as true was decreased compared to condition 2 (condition 2 vs 3). However,asking people people about their rationales via a checkbox increases sharing of content that they initiallyperceived as false (condition 3 vs 4).

Table 7. The headlines used in the user study of behavioral nudges. Partisanship is on a scale of -2 (morefavorable for Democrats) to 2 (more favorable for Republicans)

Topic Headline Veracity Partisanship P o l i t i c s Asylum-Seekers Can Apply at U.S. Embassies Abroad False 0 . . . . − . . − . . . − . − . . . . . . − . − . . . . . − . − . − . − . − . . S c i e n c e & T e c h A Scientist Was Jailed After Discovering a Deadly Virus Delivered ThroughVaccines False 0 . − .

30A Scientific Study Proved That “Conspiracists” Are “The Most Sane of All” False 0 . . . . Table 7 – continued from previous page

Category Headline Veracity Partisanship

Abortions are Linked to an Increased Risk of Breast Cancer False 1 . − .

86A 17-Year-Old Eagle Scout Built a Nuclear Reactor in His Mom’s Backyard True 0 .

04A Study Showed That Dogs Exhibit Jealousy True − . . − . . . − . − .