[PDF] Investigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria

Abstract

Misinformation about critical issues such as climate change and vaccine safety is oftentimes amplified on online social and search platforms. The crowdsourcing of content credibility assessment by laypeople has been proposed as one strategy to combat misinformation by attempting to replicate the assessments of experts at scale. In this work, we investigate news credibility assessments by crowds versus experts to understand when and how ratings between them differ. We gather a dataset of over 4,000 credibility assessments taken from 2 crowd groups---journalism students and Upwork workers---as well as 2 expert groups---journalists and scientists---on a varied set of 50 news articles related to climate science, a topic with widespread disconnect between public opinion and expert consensus. Examining the ratings, we find differences in performance due to the makeup of the crowd, such as rater demographics and political leaning, as well as the scope of the tasks that the crowd is assigned to rate, such as the genre of the article and partisanship of the publication. Finally, we find differences between expert assessments due to differing expert criteria that journalism versus science experts use---differences that may contribute to crowd discrepancies, but that also suggest a way to reduce the gap by designing crowd tasks tailored to specific expert criteria. From these findings, we outline future research directions to better design crowd processes that are tailored to specific crowds and types of content.

Full PDF

993Investigating Differences in Crowdsourced News CredibilityAssessment: Raters, Tasks, and Expert Criteria

MD MOMEN BHUIYAN,

Virginia Tech

AMY X. ZHANG,

University of Washington

CONNIE MOON SEHAT,

Hacks/Hackers

TANUSHREE MITRA ∗ , University of WashingtonMisinformation about critical issues such as climate change and vaccine safety is oftentimes amplified ononline social and search platforms. The crowdsourcing of content credibility assessment by laypeople hasbeen proposed as one strategy to combat misinformation by attempting to replicate the assessments of expertsat scale. In this work, we investigate news credibility assessments by crowds versus experts to understandwhen and how ratings between them differ. We gather a dataset of over 4,000 credibility assessments takenfrom 2 crowd groups—journalism students and Upwork workers—as well as 2 expert groups—journalists andscientists—on a varied set of 50 news articles related to climate science, a topic with widespread disconnectbetween public opinion and expert consensus. Examining the ratings, we find differences in performance dueto the makeup of the crowd , such as rater demographics and political leaning, as well as the scope of the tasks that the crowd is assigned to rate, such as the genre of the article and partisanship of the publication. Finally,we find differences between expert assessments due to differing expert criteria that journalism versus scienceexperts use—differences that may contribute to crowd discrepancies, but that also suggest a way to reducethe gap by designing crowd tasks tailored to specific expert criteria. From these findings, we outline futureresearch directions to better design crowd processes that are tailored to specific crowds and types of content.CCS Concepts: •

Human-centered computing → Human computer interaction (HCI) ; Empirical studies inHCI ;Keywords: misinformation, crowdsourcing, credibility, news, expert

ACM Reference Format:

Md Momen Bhuiyan, Amy X. Zhang, Connie Moon Sehat, and Tanushree Mitra. 2020. Investigating Differencesin Crowdsourced News Credibility Assessment: Raters, Tasks, and Expert Criteria. In

Proceedings of the ACMon Human-Computer Interaction , Vol. 4, CSCW2, Article 93 (October 2020). ACM, New York, NY. 26 pages.https://doi.org/10.1145/3415164

A misinformed citizenry, when it comes to critical issues impacting public health and public policysuch as climate change and vaccine safety, can lead to dangerous consequences. As misinformationproliferates online, social and search platforms have sought effective mechanisms for tackling harm-ful misinformation [60]. One strategy that many online platforms have deployed is partnerships ∗ This work was conducted while the author was at Virginia Tech.Authors’ addresses: Md Momen Bhuiyan, Virginia Tech, [email protected]; Amy X. Zhang, University of Washington, [email protected]; Connie Moon Sehat, Hacks/Hackers, [email protected]; Tanushree Mitra, University of Washington,[email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.2573-0142/2020/10-ART93 $15.00https://doi.org/10.1145/3415164Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. a r X i v : . [ c s . H C ] A ug with expert groups to judge the credibility of articles posted on their platform. Initiatives includeFacebook’s fact-checking program, which employs third-party groups such as Climate Feedback’scommunity of science experts [63] to rate articles that then get a warning label or down-ranked inusers’ feeds [18]. However, expert feedback is hard to scale, given the relatively small number ofprofessional fact-checkers and domain experts. Thus, in recent years, platforms and third-partyorganizations have developed tools and processes to relax the expertise criteria needed to judge thecredibility of news articles. Initiatives that have pursued a low-barrier crowdsourced approach tofact-checking include TruthSquad [21], FactcheckEU [19], and WikiTribune [51]. However, theseprior attempts have had their own issues with maintaining high quality at scale, due to crowd-sourced content requiring additional input from a relatively small number of editors [4]. As a result,most crowdsourced approaches still do not scale well, due to needing final judgments by expertsor only using crowds for secondary tasks while primary research is still delegated to experts [4].Despite the interest in scaling up news credibility assessment, there is still a great deal that isunknown about when and how crowd credibility assessments align with experts.How can we better understand the considerations to take into account when developing scaleablecrowdsourced processes for news credibility assessment? In this work, we investigate three compo-nents of crowdsourcing news credibility in particular: how crowd alignment with experts changeswith regards to the background and identity of the crowd , the scope of task in terms of the newscontent being assessed, and the type of expert criteria being measured against. We gather a largedataset of 4,050 news credibility ratings, spanning 4 types of raters (2 crowd groups and 2 expertgroups) and 81 individuals in total, on 50 articles in the domain of climate science —an area withwidespread disconnect between public opinion and scientific consensus. Through a focus on climatescience, a field in which strong expert consensus exists, we provide a more consistent basis uponwhich to hypothesize efforts to crowdsource other topics overall, including those with less expertconsensus (e.g., emerging knowledge of COVID-19, politics) [1]. The crowd raters we compareinclude journalism students and Upwork workers. We contrast news credibility ratings from thesetwo groups with ratings by experienced journalists and climate scientists. All data collected forthis work has been released publicly , with individual identities anonymized.We find that about 15 ratings from either the journalist students or the Upwork workers areneeded in order to achieve 0.9 correlation with journalism experts. However, when it comes toscience experts, 15 ratings from either crowd group only result in 0.7 correlation with scientists.Overall, we find little difference between our two crowd groups in terms of correlation to experts.But when we examine across crowd groups to consider how the personal traits of age, gender,educational background, and political leaning alter ratings, we find that raters with less educationand those who were not Democrat have higher disagreement with experts.Besides differences due to the makeup of the crowd, we additionally determine differences incredibility ratings by the kind of content being evaluated in the requested task. When we breakdown how different groups’ ratings differ according to characteristics of the article, such as itsgenre and the partisanship of its publication, we find that crowd groups have stronger correlationwith experts on opinion articles and articles from more left-leaning publications.Finally, our analyses uncover differences in the criteria used to determine credibility betweenour two expert groups on their news credibility ratings. These differences flow to crowds—asscience and journalism experts disagree more on a piece of news, crowd raters disagree more aswell. In order to understand why experts disagree, we gather 147 open-ended explanations by ourexperts regarding the criteria they used to make their ratings. We find that science experts putemphasis on criteria related to accuracy, evidence, and grounding presented in the article, while Data: https://data.world/credibilitycoalition/credibility-factors2020Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:3 journalism experts stress publication reputation. This difference may explain why crowds havegreater correlation with journalists rather than scientists.The differences in expert criteria of what constitutes credibility, along with our findings ondifferences in crowd performance based on background and article type, suggest a future line ofwork to design crowdsourced news credibility processes that are tailored towards particular typesof expertise. Instead of broadly rating credibility, different crowd rating tasks might align withdifferent experts. In the case of straight reporting of climate-related conferences or events, forexample, one might ask crowds to align more with signals used by journalism experts, whereasreporting on scientific conclusions might align more with signals tied to science expertise. Wediscuss this possibility and present some preliminary findings in our Discussion. At a high level,our results suggest two strategies— person-oriented and process-oriented —to improve task design byrespectively filtering on rater background during the recruitment and training devised towardsreducing particular differences. By taking into account diverse expert criteria and task fitness intothese strategies, future designers may improve the reliability of crowdsourced news credibility.Altogether, our work offers a deeper understanding of the conditions under which crowdsourcedannotations might serve as a proxy for different forms of reliable expert knowledge.

Credibility is often defined as a multi-dimensional construct comprising believability [23], fair-ness [26], reliability [59], quality [66], trust [32], accuracy [22], objectivity/bias [15, 44] and “dozensof other concepts and combination thereof” [30]. Compared to other works, credibility has beendefined by Flanagin and Metzger as made up of two primary dimensions: trustworthiness and exper-tise [20]. Oftentimes, credibility is targeted at just the message and/or the source, while some extendit to consider context, such as the channel or medium where the message is published [36, 42].However, research has also shown that receivers often do not distinguish between message sourceand the medium [12]. Furthermore, scholars from information science to cognitive psychology canrange in their definition of credibility as a purely objective assessment or a subjective judgmentby the information receiver, adding complexity to the primary dimensions [20, 22, 57]. Despitesignificant scholarly work in multi-disciplinary domains, the definition of credibility and its mea-surement still lacks a unified strategy [30]. Consequently, in this work, we approach credibility asa blend of subjective and objective assessments of the “message,” in this case, the news article.

Though much has been made about the “wisdom of crowds,” it is still unclear whether crowdsourcingcan be an effective strategy for assessing news credibility and misinformation in a reliable andsystematic way. Partly this has to do with the limits of crowds on certain topics. It is acceptedthat collective wisdom can be better than an individual’s judgment, including those of individualexperts [69]. These conclusions are based upon mathematical principles, which however alsoindicate the converse—that in certain circumstances, the collective can perform a great deal worse.One circumstance is when crowds do not have enough relevant information, suggesting that abaseline expertise is necessary [3, 68]. Crowds may also make mistakes due to an incorrect generalperception about whether a piece of information is false or true [3]. Other characteristics of thecrowd, such as its diversity, size, and suitability towards the task in question also play a part [40,47, 71, 73]. Given this prior work, the question we consider then is not whether crowdsourcing is aviable approach for news credibility assessment but instead under which conditions can we unlockthe “wisdom of select crowds” [39].

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020.

Prior literature suggests that some segments of the population are potentially worse at assessingnews. For example, research has found that conservative-leaning, older, and highly politically-engaged individuals are more likely to interact with “fake news” in the U.S. [28, 41, 72] In addition,strong analytical thinking is associated with increased capacity to discern true headlines fromfalse or hyperpartisan ones [58]. Certain topics can be polarizing for audiences, leading to pooralignment with experts for portions of the public with a particular political leaning, such as inthe case of climate science [25]. Yet other prior work shows that laypeople even in polarizedcontexts are able to discern high quality content from low quality ones [53] and are overall highlycorrelated with ratings from professional checkers [16]. Research has also found that homogenousgroups of people can help increase accuracy while reducing polarization—strengthening the casefor crowdsourced ratings [6]—an aspect we delve into while focusing on credibility assessment ofnews articles pertaining to climate science, a highly polarized topic among non-experts.

Though crowds’ performance may vary depending on demography, their performance can alsodepend on what task is being asked of them. For example, researchers have encountered differenceswhen the public is asked to fact-check versus assess media trustworthiness [4, 54, 61]. Becausecrowdsourced fact-checking continues to prove challenging, a subjective rating task like trustwor-thiness might be far less complex and better suited to crowds than fact-checking [4]. In fact, due tothis difficulty in fact-checking, research shows that some topics (e.g., economy and politics) havehigher probability of getting asked to be checked than others (e.g., education and environment) [29].There may also be differences when it comes to the unit of content analysis: claims, tweets, articles,and sources [11, 46, 52]. Additionally, the subject area of news coverage may make a difference;some topics may be easier to understand, such as events versus specialized science or health news.Research has also found that most Americans do only slightly better than chance at distinguishingfactual from opinion news statements [45], and half are unfamiliar with the term “op-ed” [24]. Thisis concerning as opinion pieces have different journalistic standards compared to news articles.Finally, as mentioned previously, readers’ political biases may also play into their assessment of apiece of content [43]. This is why in order to assess these content-level constraints, we analyze theperformance of crowds on articles divided by genre and the political leaning of the article’s source.

Finally, little is known about how different experts make use of the information embedded innews content in their credibility judgments. That is, many crowd assessments measure a crowd’salignment with a body of experts from a single domain, but multiple expertise can be in scope interms of news credibility—in our case, scientific and journalistic. Thus, there might be differentcriteria against which an approach at scale may wish to align. For example, while examining howfinance and health experts rank websites in their respective fields, scholars found some innatedifferences in respective domains (e.g., nature of information in one domain being “proven” versusanother one being “predicted”), as well as experts’ behavioral differences in perceiving websitecharacteristics (e.g., differences in emphasis on visual characteristics) [64]. While one might tryto control for such intra-domain differences among experts by careful selection of the topic (e.g.,where the majority of the experts agree such as in climate science [1]), our understanding of howdifferent domain experts would judge the same piece of news content is still limited. We fill this gapby examining the different criteria used by domain experts—in our case, environmental scientistsand journalists—when it comes to credibility.

Overall, the assumption that a relationship between crowds and experts can be established ina meaningful way at scale underlies many approaches in the field, and it is the approach to thisrelationship that this study seeks to complicate.

In this work, we conduct an investigation into three major considerations for crowdsourcingnews credibility at scale. Based on the literature thus far, we expect that the crowd and subjectarea experts will perceive the credibility of news information differently. To systematically andempirically understand this difference, we consider the following dimensions: • Differences in ratings might reside in the raters , as some raters are likely to be more inalignment with expert judgment. Aspects about the background of these raters could perhapshelp select suitable raters. • Other differences might reside in the task they are given—in this case, the articles they areassigned to assess, as news articles can vary along several spectra. For example, raters andexperts may differ in noteworthy ways as they evaluate opinion pieces as opposed to “straight”news, or articles that have perceptible political lean. • Finally, differences might reside in what criteria is used to judge credibility in news stories. Ifexperts are using different criteria to determine credibility, some of them may be more orless accessible to or mirrored by crowd raters.In order to understand these potential differences, we ask the following research questions: • RQ1: How do crowd raters compare with experts when it comes to news credibility assess-ments? • RQ2: How do personal characteristics of age, education, gender, and political leaning affectcredibility ratings from the crowd? • RQ3: How do characteristics of news articles, such as article genre (news, opinion, analysis)and political lean of the publication, affect credibility ratings? • RQ4: How do experts in science versus journalism differ in the criteria they use to assessnews credibility?The first RQ confirms the initial assumption that experts and crowd raters disagree. However, thedifferences are not uniform—instead, we see that crowd raters tend to agree with journalism expertsmore, and that as experts disagree more, crowd raters disagree more as well. Surprisingly, we findno major differences between the two populations we recruited from—journalism students andUpwork workers. We further explore in RQ2 the suitability of different segments of the crowd forassessing news credibility. We do find differences across the board based on educational backgroundand political leaning. RQ3 then focuses on the nature of task suitability for crowd raters accordingto the characteristics of articles, finding that crowds correlate more with experts when it comes toopinion articles and left-leaning publications. Interestingly, journalism and science experts alsocorrelate more closely with each other in those cases, while having greater disagreements when itcomes to the news and analysis genres and center-leaning publications.Some of these differences can be illuminated by RQ4, which delves into the criteria that differentexperts use to assess news credibility. We find that science experts focus more on the evidencepresented in the article and the underlying accuracy of the claims, while journalism experts focuson the publication reputation of the news outlet and overall professional standards. Indeed, forsome types of articles, such as ones that report on a press conference, the criteria used by journalistsmay be more relevant, while for other types of articles, such as ones that report on a new scientificfinding, scientists’ criteria may be preferred. These differences may also explain why crowds alignmore with journalists, as the criteria they use may be easier for crowds to assess. We conclude with

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. F e m a l e M a l e M i x e d c o un t StudentUpwork D e m o c r a t I n d e p e n d e n t R e p u b li c a n O t h e r - - + H S & S C o ll Y C o ll & C o m m P r o f Fig. 1. The figures show user distribution by gender, political party, age and education for our crowd groups.For education, “HS&SColl”, “4YColl&Comm” and “Prof” stands for respectively “High School & Some CollegeNo Degree & Some College”, “4 Year College & Community College/Vocational Training” and “Professional &Graduate Degree”. a discussion of how to design news credibility assessment tasks that are tailored to specific crowdsand contents.

We wished to isolate differences in crowdsourcing based in the raters, tasks, or between disciplinaryfields themselves as opposed to disagreements due to lack of internal consensus among subjectarea experts on the underlying facts. For this reason, we chose to focus on scientific topics witha high degree of consensus among domain experts, as opposed to political topics in which thepotential for stable ground truth is much more challenging. We also needed a subject matter thatgenerates enough examples of news and in which misinformation or problematic informationappear regularly, as these are the conditions under which major platforms are operating whensurfacing articles to fact-checkers.Thus, we selected 50 articles focusing on climate and environment issues, a topic that has ahigh degree of consensus among science domain experts but that has also become politicized. Togather articles, we began with the Buzzsumo social media research tool in late 2018 to find the mostpopular English-language articles over the previous year with the keywords of “climate change,”“global warming,” “environment,” and “pollution.” Then, among the top results, we selected a set ofarticles with varying amounts of scientific reference. We also sought to diversify the number ofoutlets publishing the articles. In addition, we sought to include a range of liberal to conservativepositions or attitudes towards climate problems in the article selection .We expect a certain amount of correlation between conservative positions and less credibleinformation on climate science, based on past studies, that may not generalize to other topics. Butby conducting a deeper exploration of a single domain, we gain richer evidence upon which wecan make inferences regarding the reasoning behind certain differences in ratings. This allows ourstudy to consider implications for design more broadly across the dimensions of raters, tasks, andexpert criteria despite being grounded within a single domain. We collected credibility ratings on articles from four different groups, including two crowd groupsconsisting of: 1) 49 participants recruited from journalism and media schools, as well as 2) 26Upwork crowdworkers, and two expert groups comprising: 3) three climate scientists, and 4) See Appendix B for our article distribution across sources.Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:7 three journalists. Each crowd and expert rater rated all 50 articles in our dataset. Demographicinformation for the crowd groups can be found in Figure 1.

Students : The first group was canvassed through the

Credibility Coalition network, which hasworked directly with nonprofits and journalism schools to build up a cohort motivated to combatmisinformation. They are predominantly pursuing higher education in the U.S. and tend to bepolitically liberal. The Credibility Coalition actively recruited, e.g., with campus Republican clubs,to achieve more demographic balance for the study. Upwork : In addition, we also used the

Upwork platform for freelancers to gather from a moregeneral population. For this study, we restricted participants to the U.S. Participants were admittedon a first-come basis until demographic balance became an issue (i.e., politically liberal respondentswere declined once more conservatives were needed for balance).With regards to experts, despite the realistic challenge of recruiting people with subject areaexpertise to participate due to their other obligations, we nevertheless sought more than a singleexpert’s input to be able to capture how experts differ amongst themselves. In total, we recruitedthree experts each for the two types of expertise represented in this study.

Scientists : Three science experts were directly referred to us by contacts at major science organi-zations, including Climate Feedback, AAAS, and the National Academy of Sciences. Two of ourexperts are male, one is female. All three of our experts possess a Ph.D. in a climate-related field:two related to oceanography and atmospheric science, and one that intersects environment andeconomics.

Journalists : Our three journalism experts, reached through personal networks, each possess atleast seven years of professional journalism experience in the U.S. Professional experience meansthat they received compensation for full-time positions within the journalism industry as writers,editors, and reporters of stories. Two of the experts are male, one is female. Two of our expertsworked for major national newspapers while one worked for major broadcast news networks.To clarify the difference between expert fields, our news experts were not science journalists.It is worth noting that science articles can be written by non-science journalists, especially amidthe downsizing of news departments and as seen with sports desks writers who have recentlybeen reassigned to coronavirus beats [27, 38]. In addition, the relationship between science expertsand news professionals need not always be harmonious: sometimes journalists provide a neededfunction of accountability and transparency outside of the scientific profession [7].

The approach for this study kept the challenge of large-scale information assessment in mind. Forthis reason, the questionnaire was designed to be short, in order for raters to be able to assessmany articles. Before participation, crowd raters filled out a demographic survey. We also requiredcrowd raters to commit to an Annotator Code of Conduct provided in their informed consent,which included performing their duties in as accurate and diligent manner as possible, and avoidingconflicts of interest.All raters, crowd and expert, received reading and rating tasks as shown in Figure 2 using anannotation platform called Check . For each article, all raters were asked to read the article andprovide their perception of the article’s credibility on a 5-point Likert scale, ranging from very low (1) to very high (5). Crowd raters completed all tasks across a 7–10 day period (estimated at 10hours total) with a recommended limit of 10–15 minutes per article. After completing 50 articles ontime, they received the full payment of $150. https://credibilitycoalition.org https://meedan.com/checkProceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. Fig. 2. An image of the questionnaire in the Check annotation tool.

In addition to providing ratings, all six expert raters optionally provided an open-ended rationalefor their credibility rating for each article, resulting in 147 rationales out of a possible 300 across allexperts and articles. After the completion of 50 articles, expert raters received a payment of $300.Finally, we asked only the journalism experts to additionally classify each article across threecategories:

News , Opinion , and

Analysis (understood as a close examination of a complex newsevent by a specialist [62]). We consulted journalism experts while developing these three categoriesalong with the ability to select

Not Sure . This would allow us to better understand the potential forgenre-related differences in our analysis. Of the articles, 48 of them had a majority genre appliedby the three experts, with 32 classified as

News , 8 as

Opinion , and 8 as

Analysis . Much of our analysis includes inter-rater reliability, correlation between groups, and a series ofregressions. Throughout, we used Krippendorf’s alpha for inter-rater reliability which is appropriatefor differing data types including ordinal, nominal, and interval. For correlation analysis, we usedSpearman’s rank correlation—a nonparametric measure of the strength and direction of associationbetween two variables. To realize the required number of raters needed, we performed a poweranalysis with settings including a significance of 0.05, a large effect size of 0.5 and a power of 0.8[9]. This resulted in a required sample size of 29. For robustness in the analysis and to accountfor sampling error, we calculated the correlation 100 times by bootstrapping, similar to relatedwork [47]. Additionally, we used a general ordinary least squares (OLS) linear regression on ourdata. Such a regression model despite less-than-perfect fit compared to non-linear models, havegreater interpretability.

We begin by analyzing the credibility ratings made by our two crowd rating groups and comparetheir ratings with ratings made by our two expert groups. Considering all the ratings we collectedfrom each group, Table 1 presents the inter-rater reliability (IRR) and average credibility ratingswithin each of our two crowd groups—

Student and

Upwork —and our two expert groups—

Science and

Journalism . Overall, we see that the experts had much higher IRR within each group than the

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:9 α Avg. Credibility Rating(Std. Dev.)Student 49 0.44 3.49 (1.32)Upwork 26 0.48 3.34 (1.33)Expert[Science] 3 0.75 3.21 (1.27)Expert[Journalism] 3 0.83 3.60 (1.42)

Table 1. Inter-rater reliability using Krippendorff’s alpha ( α ) within all 4 rater groups on the question ofcredibility across 50 articles, along with average credibility rating. No of Users Sampled from Crowd Group C o rr e l a t i o n o f C r e d i b ili t y Student vs Expert[science]Upwork vs Expert[science]Student vs Expert[journalism]Upwork vs Expert[journalism]Student vs UpworkExpert[journalism] vs Expert[science]

Fig. 3. Correlation of credibility ratings among all pairs in four groups: 2 crowd and 2 expert groups. In eachcrowd group, we sample the number of raters from 1–25. For expert groups, we take all 3 ratings. Then wecompute the Spearman ρ between the mean responses from each group on all 50 articles. The plot showsaverage ρ after 100 resamplings. crowd raters, with the journalists most aligned at 0 .

83. We also compute the correlation withineach expert group, i.e., comparing one expert with the other two. Again, science experts showlower correlation (sci =0.72, sci =0.72, sci =0.62, jour =0.80, jour =0.77, jour =0.80). We note thatour one scientist with 0.62 correlation with the other scientists comes from a social science andenvironmental studies background as opposed to purely environmental studies, demonstratingthat specific expertise even within a field could potentially give rise to differences in credibilityassessment. On average, science experts had the lowest average credibility scores while journalismexperts had the highest, and the two crowd groups were in between.We also compute the correlation of credibility ratings among all combinations of groups usingSpearman’s ρ . Figure 3 shows the pairwise correlation between rater groups when we vary thenumber of raters from 1 to 25 in Student or Upwork . We randomly sample 100 times from eachgroup and then average the result; using this strategy, no individual rater has undue weight. Thisapproach has also been used in prior studies for reliably comparing large crowds with limitedexpert ratings [46]. With only 3 raters in each group of experts, we simply average them per group.We find that the correlation between the two expert groups is 0 .

77. Correlation between the twocrowd groups starts off low at about 0.4 with only 1 rater, but becomes high ( ρ = 0.9) with about15 or more raters within each group. This suggests that when averaging across 15 or more raters,both rater populations begin rating about equivalently. To account for lack of demographic controlbetween the two crowd groups, we performed similar analysis on a matched data set shown inAppendix A. We find some minor differences including a slight lowering of the correlation betweenthe two crowd groups as well as between students and experts. However, results from the matcheddata do not contradict our findings, offering additional confidence to our overall results. Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020.

Cred sci

Cred jour s t d . d e v . o f c r e d i b ili t y studentupwork Fig. 4. Changes in standard deviation of crowd groups’ credibility rating as absolute distance between twoexpert groups’ average credibility rating grows.

When we dive into the correlation of each crowd group to each expert group, differences emerge.First, we notice that

Upwork has slightly higher correlation with both sets of experts than

Student .The gap, while small in both cases, is nonetheless robust in the case of journalists (0.04, t = . p < .

02) averaging across 1–25 raters. In the case of scientists, the gap was 0.02 ( t = . p < . Student and 0.89 for

Upwork ). However, crowd raters get only about 0.72 (0.71 for

Student and 0.73 for

Upwork ) correlation with scientists using 15 raters, and ratings do not improve at 25raters. The difference between correlation with scientists versus journalists is a major one, withcrowds aligning with journalists more (0.13 difference for

Student and 0.15 difference for

Upwork ).However, recall that our analysis of correlation within individual experts show a range between 0.6and 0.8. Both sets of crowd raters at 15 ratings each still fall within that range in their correlationwith experts.

Finally,we examine how our crowd groups’ ratings change when expert groups diverge in their ratingfrom each other. Figure 4 shows the plot of standard deviation of the crowd workers as the absolutedifference between the average credibility ratings of the two expert groups goes up. The figureshows an almost linear upwards trend for the Upwork workers. Students also have an upwardtrend initially, though this trend reverses at the last point. Comparing the two crowd groups,we find medium correlation between their standard deviations (Spearman ρ = .

59, p<0.001).Unsurprisingly, as the disagreement grows between the expert groups, credibility ratings of thecrowd also diverges. In RQ3 and RQ4, we examine in more detail the articles that lead to higherexpert disagreement, finding that factors include the type of article and differences in expert criteriaregarding credibility.

Next, we examine more deeply the crowd raters and consider their demographics. To determine howcrowd raters’ personal characteristics, such as their age and gender, relate to how well they agreedwith experts, we perform an OLS regression on the error in our crowd raters’ credibility ratingwhen compared to experts’ average rating. In Tables 2 and 3, we present 6 models, where ratingsfrom just

Student , just

Upwork , and then

Student and

Upwork combined are compared againstratings from

Science and then

Journalism . We re-coded crowd raters’ education responses intothree larger groups due to low quantities for some of the responses: combining “High School”,“Some College No Degree” and “Some College” into one and “4 Year College” with “Community

Expert[Science]Student Upwork Stud.+Upwork β (sig.) Err. Cohen’s f β (sig.) Err. Cohen’s f β (sig.) Err. Cohen’s f Intercept 0.13 * -0.06 -0.04 -0.07 0.12 *** -0.04Edu[4Y&CColl] 0.13 *** -0.04 0.03 0.05 * -0.02 0.01 0.07 *** -0.02 0.02Edu[HS&SColl] 0 -0.04 0.03 0.01 -0.04 0.01 -0.03 -0.02 0.02Gender[Male] -0.02 -0.01 0.01 -0.04 * -0.02 0.00 -0.03 *** -0.01 0.01Age[26-30] -0.05 -0.04 0.00 -0.01 -0.03 0.01 -0.06 *** -0.02 0.00Pol[Indep.] 0.06 *** -0.02 0.01 0.11 *** -0.02 0.03 0.06 *** -0.01 0.02Pol[Other] 0.04 -0.02 0.01 0.15 *** -0.03 0.03 0.08 *** -0.01 0.02Pol[Repub.] 0.08 *** -0.02 0.01 0.13 *** -0.04 0.03 0.10 *** -0.01 0.02 N R /Adj. R Table 2. OLS regression on error in credibility rating compared to science experts’ average rating afterrecoding and non-significant rows omitted. The reference for education, gender, age and political leaning are:Graduate degree, Female, 18-25 and Democrat. Numbers in green are negative coefficients with significantp-values contributing to less error; numbers in red are vice-versa. Here, Cohen’s f and adjusted R are theeffect size of each variable and each model respectively. Conventionally, Cohen’s f of 0.02, 0.15, and 0.35 aretermed small, medium, and large, respectively. Expert[Journalism]Student Upwork Stud.+Upwork β (sig.) Err. Cohen’s f β (sig.) Err. Cohen’s f β (sig.) Err. Cohen’s f Intercept 0.27 *** -0.06 0.11 -0.07 0.26 *** 0.04Edu[4Y&CColl] 0.11 *** -0.04 0.03 0.04 * -0.02 0.00 0.06 *** 0.02 0.02Edu[HS&SColl] -0.03 -0.04 0.03 0.01 -0.04 0.00 -0.05 *** 0.02 0.02Gender[Male] -0.03 * -0.01 0.01 -0.04 *** -0.02 0.00 -0.03 *** 0.01 0.01Age[26-30] -0.06 -0.04 0.00 -0.03 -0.03 0.01 -0.06 *** 0.02 0.00Pol[Indep.] 0.07 *** -0.02 0.02 0.12 *** -0.02 0.03 0.07 *** 0.01 0.02Pol[Other] 0.06 ** -0.02 0.02 0.15 *** -0.03 0.03 0.08 *** 0.01 0.02Pol[Repub.] 0.10 *** -0.02 0.02 0.14 *** -0.04 0.03 0.11 *** 0.01 0.02 N R /Adj. R Table 3. OLS regression on error in credibility rating compared to journalism experts’ average rating afterrecoding and non-significant rows omitted. The reference for education, gender, age and political leaning are:Graduate degree, Female, 18-25 and Democrat. Numbers in green are negative coefficients with significantp-values contributing to less error; numbers in red are vice-versa. Here, Cohen’s f and adjusted R are theeffect size of each variable and each model respectively. Conventionally, Cohen’s f of 0.02, 0.15, and 0.35 aretermed small, medium, and large, respectively. College/Vocational Training” into another. We also divided raters into “18-25”, “26-30”, and “31+”age groups.

Among our variables, consistent across all models, crowd raters witha non-Democrat political leaning had higher error in their assessment (where error is alignmentwith the experts in the particular model). In addition, males had lower error compared to females;the difference is small but significant in all the models except one. Among age groups, people aged26–30 had lower error compared to those aged 18–25; however those values are only significantin the omnibus models. Other age ranges had no significant results. On the other hand, crowdraters with a four-year college or community college degree had higher error compared to thosewith a graduate degree. Surprisingly, raters with a high school degree or some college experiencehad lower error compared to those with a graduate degree in one of our models (

Student + Upwork compared with

Journalism ). This may be because the majority of our crowd raters in the

Student group are assumed to still be in college, and perform relatively well due to exposure to journalism

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020.

Count Student Upwork Expert[sci] Expert[jour]

Opinion 8

News 32 0.311 0.339 0.518 0.537Left 6 0.251 0.304 0.236

Center 24 0.095 0.141

Table 4. IRR across article genres and political leaning of article sources. Here, the numbers in bold representthe highest IRR for each rater group across article genres/political leaning. and media studies. Thus in addition to the aspects of potential bias due to political orientation,potentially exacerbated in the case of climate change news as we expected, we find that the issueof formal training and education is important to consider.

In this section, we investigate specifically how the genre of an article as well as the political leaningof the publication result in differences between expert and crowd ratings. Given the difficulty thatAmericans have with factual and opinion statements within news articles, we first consider article genre . As explained earlier, journalism experts additionally classified the genre of articles in ourdataset, applying “Opinion”, “Analysis”, and “News”. We used a majority vote by the journaliststo categorize 48 out of 50 articles into their respective genres. Across News and Opinion, thejournalism experts had an IRR of 0.97; but when adding Analysis as a third category, the IRR wentdown to 0.71. The second area of interest is the political leaning of the publication behind an article. UsingMedia Bias/Fact Check, a site that classifies media sources on a political bias spectrum and that hasbeen used in prior research [8], we re-coded their 7 categories into three higher-level categoriesof left, center, and right resulting respectively in 6, 24, and 15 articles from our dataset (5 wereomitted because they had no entry in Media Bias/Fact Check). From an article source perspective,articles from both right- and left-leaning sources have higher IRR from the crowd than those inthe center (see Table 4). This suggests that annotators might have used the leaning of sources asshortcuts to identify credibility, given how political lean today equates with believing in or denyingclimate change [67].We examined how our crowd groups evaluated credibility in relation to experts for the two setsof article types. Using our previous approach of correlation analysis, we looked at the correlationof credibility between pairs of groups. For this comparison, instead of varying the number of crowdraters from 1-25, we sampled 25 crowd raters 100 times and averaged the resulting correlationsinto a single metric. To combine p-values, we show the statistics as a percentage of times p wassignificant in the samplings [5]. The pairs of groups that shows significant correlations (p<0.05)in more than 70% of the samplings are the values of interest (see Figures 5 and 6), where 70% isa heuristic we chose to reduce clutter in the figures. However, we note that some aspects of thedata have imbalance, particularly for the number of experts (n=3), the number of articles in theOpinion or Analysis genres (n=8), and from left-leaning sources (n=6). The following results needto be considered in light of these constraints. Separately, we wondered whether our crowd raters could label genre. When asked to consider just News versus Opinion,IRR was lower at 0.43 for

Student and 0.49 for

Upwork but the majority assessment of each crowd group was 100% alignedwith experts. Most articles labeled “Analysis” by journalists were labeled “News” by the crowd groups.Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:13 D i s t r i b u t i o n o f C r e d i b ili t y Opinion Analysis News studentupworkexpert[science]expert[journalism]

Correlation of Credibility

Student vs Expert[science]Upwork vs Expert[science]Student vs Expert[journalism]Upwork vs Expert[journalism]Student vs UpworkExpert[journalism] vs Expert[science]

Fig. 5. This figure shows the average credibility rating and standard deviation for each of the four rater groupsbroken down by article genre of opinion/analysis/news (on the left side), along with correlation analysisresults between pairs on the right side. The presence of a bar on the right means that a pair has significant(p<0.05) correlation of credibility in more than 70% of the crowd samplings. For the correlation analysis, wesampled crowd raters with n=25, sampling for 100 times, computing correlations each time and then averagingthe correlations. Note that, number of articles for some categories are skewed (Opinion = 8, Analysis = 8, andNews = 32). D i s t r i b u t i o n o f C r e d i b ili t y Student vs Expert[science]Upwork vs Expert[science]Student vs Expert[journalism]Upwork vs Expert[journalism]Student vs UpworkExpert[journalism] vs Expert[science]

Correlation of Credibility

Left Center Right studentupworkexpert[science]expert[journalism]

Fig. 6. This figure shows the average credibility rating and standard deviation for each of the four ratergroups broken down by article source of left/center/right (on the left side), along with correlation analysisresults between pairs (on the right side). The presence of a bar on the right means that a pair has significant(p<0.05) correlation of credibility in more than 70% of the crowd samplings. For the correlation analysis, wesampled crowd raters with n=25, sampling for 100 times, computing correlations each time and averaging thecorrelation. Note that, number of articles for some categories are skewed (Left = 6, Center = 24, and Right =15).

Among thedifferent news genres, our tests suggest that crowd groups had higher correlation with both groupsof experts on rating the credibility of Opinion articles. When it came to News articles, correlation ofboth crowds dropped with scientists but not with journalists. While scientists and journalists weresomewhat correlated in the case of News, we saw differences between their average ratings, withjournalists being more positive overall. This is even more pronounced in the case of Analysis, where

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. journalists regarded these articles all relatively highly. In this genre, there was less correlationacross rater groups. We explore potential reasons for this in RQ4.

Alongpolitical lines, ratings of both crowd groups had higher correlation with the experts on articlesfrom left publications. For articles from center publications, we only saw significant correlationsbetween crowds and journalists; meanwhile, scientists and journalists disagreed. We also sawa high average rating from journalism experts for center publications, which may come from aprofessional experience and training that aligns more closely with center, non-partisan sources.This possibility is also explored in greater detail in RQ4. For right-leaning publications, both expertgroups gave these articles low ratings on average, as expected, with science experts providing thelowest average rating. Interestingly, while crowd groups were highly correlated with each other,they had lower correlation with experts, and experts also had lower correlation between each other.

In order to understand why science versus journalism experts differ in their credibility assessmentsand how this might further illuminate crowd differences, we conducted a deep qualitative analysisof optional, open-ended explanations experts gave for their different credibility ratings. In total, the3 scientists gave 82 explanations across the 50 articles, while the 3 journalists gave 65 explanations.Initially, one of the authors conducted open coding across all of the explanations using a groundedtheory method to develop an initial set of 38 codes of both negative and positive expert criteria [65].All authors then discussed the codes while looking at examples of explanations, resulting in somecodes being renamed and others being split apart or merged together. The authors also workedtogether to group the codes into high-level categories, some of which have a rough mapping ontoexisting principles of journalism [49]. After additional iterations of discussion and re-coding of theexplanations, we arrived at the 8 high-level categories in Table 5. Each category is comprised ofseveral lower-level criteria that are either positive or negative with regards to impact on credibility.For example, the code “accurate, based in facts[+]” under

Accuracy means that an expert mentionedaccuracy as a positive association to article credibility in their explanation.

Overall, we found that experts mentioned

Accuracy and

Publication Reputation most fre-quently (48 times) closely followed by

Credible Evidence/Grounding (45 times) and

Impartiality (44 times). However, there were differences when we compared journalists versus scientists. By farthe most cited criteria for journalists was

Publication Reputation (Figure 7). We saw numerouscases where the journalists would either dismiss or trust the contents of an article based on thepublication’s brand and reputation: “

The Hill, while a crappy publication, has brand recognitionthat gives it more credibility. Without it the credibility ranking would be lower. ” Journalists werealso more likely than scientists to mention criteria related to

Website Aesthetic (“ serial killerfont ”) and Professionalized Practices and Standards , such as presence or lack of structuredinformation such as a dateline and low writing/editing quality: “ ...use of exclamation marks and badwriting overall reduced credibility in my mind. ”In comparison, scientists were most likely to cite issues related to

Credible Evidence/Grounding ,such as the presence or lack of citations, quotes from experts, or other evidence: “

A partisanarticle...failing to include credible sources’ comments on the decision. ” Scientists also mentioned

Impartiality often, primarily to comment on neutrality of tone. Journalists mentioned impar-tiality frequently as well but were more likely to discuss it in terms of “both sides” coverage, inboth a positive (“

Credibility enhanced by links to other publications and presentation of both sides

Accuracy Impartiality Completeness of Coverage Originality and Insight accurate, based in facts[+]inaccurate representation offacts/scientific consensus[-]misleading images[-]misleading headline[-]sensationalist headline[-]hyperbolic language[-]cherrypicking/misleading[-] neutral, nonpartisan tone/lack of attacks orinjected opinion[+]balanced/both sides of debate[+]goes against source/author’sperceived bias/hurts theirown cause[+]biased language, partisan,opinionated rantwithout substance[-]imbalanced/lack of both sidesof debate[-]goes along with perceived bias[-] provides context/explanation[+]thorough/in-depth as opposedto light coverage[+]Lack of context[-]light/cursory coverage[-] provides insight/informedimplications[+]lack of qualitydiscussion/analysis/insight[-]lack of original reporting[-]poor interpretation/uninformed implications[-]

Credible Evidence/Grounding Publication Reputation Professionalized Practicesand Standards Website Aesthetic references a credible sourceand/or confirmation bycredible source [+]quotes from experts[+]cites credible scientific study[+]includes data/charts/imageevidence [+]lack of citation[-]lack of quotes from experts[-]has citation but of bad scienceor low credibilitystudy/source[-]facts refuted by credible source/commonly known asdebunked[-] well-known/credible source[+]credible/expert author[+]low quality source[-]unknown/non-mainstreamsource/brand[-]biased source[-] dateline clearly marked[+]clear article/source standards[+]clearly labeled as opinionwhen it is an opinion[+]authoritative, professionalwriting[+]lack of dateline[-]low writing/editing quality[-]personalization of language/non-professional language[-] poor font choice[-]bad page layout[-]

Table 5. Qualitative codes under 8 major categories. +/- symbols inside the brackets show their polarity oncredibility. See Appendix C for example notes and their corresponding codes. count

Accuracy ImpartialityCompleteness of CoverageOriginality and InsightCredible Evidence/Grounding Publication ReputationProfessionalized Practices and Standards Website Aesthetic expert [jour]expert [sci] 0.0 0.1 0.2 0.3 normalized count

Fig. 7. Frequency of the categories in expert explanations for journalists versus scientists. On the left are rawcounts and on the right, the counts are normalized by the number of explanations made by journalists versusscientists in total. of argument/critics views...” ) or negative way ( “Links add to credibility. However, there is no oppos-ing/contrarian voices in this story.”)

Finally, scientists were also more likely to cite

Accuracy andwould sometimes rely on their personal knowledge about the science to evaluate the article: “

Istudy satellite imagery...A really poor study, repeatedly debunked. ”We performed a series of regressions with experts’ credibility rating as the outcome variableand their codes divided into positive and negative factors as independent variables. Table 6 showsthe result of our model for the three combinations of science experts, journalism experts, andthen the two sets of experts combined. We tested for multicollinearity in the data and foundno evidence of it (Variation Inflation Factor < 1.2, ∀ factors). The beta scores with significance Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020.

Science Journalism Sci + Jour β (sig.) std. err. β (sig.) std. err. β (sig.) std. err. Intercept 3.54 *** (0.15) 3.66 *** (0.13) 3.62 *** (0.09)Completeness of Coverage[+] 0.75 * (0.32) 0.38 (0.39) 0.57 * (0.25)Credible Evidence/Grounding[+] 0.57 * (0.28) 0.13 (0.42) 0.40 (0.23)Publication Reputation[+] 0.18 (0.67) 0.78 * (0.34) 0.59 * (0.26)Accuracy[-] -0.74 *** (0.20) -0.71 (0.57) -0.78 *** (0.20)Impartiality[-] -0.81 *** (0.23) -0.71 (0.51) -0.82 *** (0.22)Originality and Insight[-] -1.01 * (0.40) -0.53 (0.54) -0.74 * (0.32)Credible Evidence/Grounding[-] -0.99 *** (0.26) -0.71 (0.87) -0.94 *** (0.26)Professionalized Practices and Standards[-] -0.96 (0.55) -0.89 (0.49) -0.81 * (0.33) R Table 6. Regression on credibility using qualitative codes. Non-significant rows have been omitted. C o un t / N o . o f A r t i c l e Accuracy

Accuracy [sci]Accuracy [jour]

Impartiality Completeness of Coverage Originality and Insight .

00 0 .

33 0 .

67 1 .

00 1 .

33 1 .

67 2 . C o un t / N o . o f A r t i c l e Credible Evidence/Grounding .

00 0 .

33 0 .

67 1 .

00 1 .

33 1 .

67 2 . Publication Reputation .

00 0 .

33 0 .

67 1 .

00 1 .

33 1 .

67 2 . Professionalized Practices and Standards .

00 0 .

33 0 .

67 1 .

00 1 .

33 1 .

67 2 . Website Aestheticabsolute distance betn. avg. cred. of the two exp. groups

Fig. 8. Count of occurences of the codes normalized by the number of articles for articles with differingabsolute distance between science and journalism [abs(avg(sci) - avg(jour))] experts’ average credibilityratings. demonstrate that science experts cite multiple categories whereas journalists tend to focus on

Publication Reputation , primarily as positive evidence. However, for science experts, amongthe different criteria that could increase or decrease their credibility perception, only

CredibleEvidence/Grounding had both significant positive and negative impact. The remaining categoriesonly boosted their perception of credibility (

Completeness of Coverage ) or only negativelyinfluenced it (

Accuracy , Impartiality , Originality and Insight ). InFigure 8, we show how often a particular criteria is provided by scientists versus journalists asthe absolute difference between their average ratings for an article increases. We can see that asdisagreements between scientists and journalists grow, their rationales diverge, with scientistsciting

Accuracy more, and journalists citing

Publication Reputation and

ProfessionalizedPractices and Standards more.We inspected some examples of articles with high absolute differences in ratings between theexperts to illustrate how these differences emerged. For instance, in one case, scientists rated anarticle by the Daily Wire, an outlet considered to have a “right bias with mixed factual reporting”according to the site Media Bias/Fact Check, as considerably more credible than journalists did

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:17 (1.67 difference in average ratings). The article was reporting on an academic publication, leadingone scientist to write “ reasonable reporting on a study that has some issues with reaching claims ” andto give it a 3 out of 5. Journalists were considerably more harsh, taking the article to task for issuessuch as lack of

Originality and Insight and

Professionalized Practices and Standards :“ ...it’s a news story that cites a study but has no real original or live onsite reporting. Lack of deadlineundermines credibility .” They also mentioned Publication Reputation , with one person statingthe article’s credibility was “ undermined by association [with] previous content deemed not credible ”on the site.In another case, we saw journalists this time giving an article by BBC News a higher rating (5 outof 5 across the board), while scientists all gave the article a 3 out of 5. Unsurprisingly, journalistsmentioned

Publication Reputation , with one person saying that credibility was “ ...enhanced byassociation with BBC brand.”

However, scientists found issues with

Accuracy , calling out the piecefor misleading images and a misleading headline: “ the title including the word ‘hothouse’ can bemisleading as it suggests a runaway global warming, which is not possible on Earth. ”These examples point to the shortcuts that journalists sometimes employ by focusing on anarticle’s publication or more superficial elements of style and presentation, as opposed to thecontents of an article. This may be necessary in cases when they cannot easily consult the underlyingscientific source and do not have access to the deep domain knowledge that scientists can draw upon.This may be why we saw journalists giving uniformly high ratings to center-leaning publicationsin RQ3. This may also explain why crowd raters tended to agree with journalists more.Finally, we noticed a few major differences in ratings stemmed from differences in interpretinggenres of news articles. In several instances, we saw scientists giving lower scores to articles thatwould be considered “straight news”, or news articles that concisely and impartially report factsabout an event, while journalists gave them a 5. For example, in one article labeled News by thejournalists where there was a difference of 1.3 between scientist and journalist ratings, a scientistgave the following rationale for their rating of 3: “

Neutral account of incident, no insight provided. ”This may be why we see scientists invoking

Completeness of Coverage at a higher rate thanjournalists, as journalists may perceive a concise article without in-depth coverage as a valid pieceof journalism. This could also explain why journalists overall gave high ratings for the genres ofanalysis and news in RQ3 compared to scientists.

This study investigated several sources of difference between the layperson assessment of newscredibility and that of experts in science and journalism, all towards the goal of informing crowd-sourced processes for news credibility assessment at scale. RQ1 affirmed that crowds do not alwaysagree with experts, and experts do not always agree amongst themselves. If the goal is to aligncrowds to experts, we find that it takes about 15 crowd raters to achieve high correlation, afterwhich correlation begins to plateau. However, this number might be reduced if we tailor crowdsand tasks, given our findings in RQ2 and RQ3.Interestingly, we find that the

Upwork crowd has a slightly higher correlation with experts thanthe

Student group, many of whom presumably took coursework in media literacy or journalism.However, while

Upworkers have a more varied demography than our

Student group, they alsolikely have high rates of digital literacy as online freelancers [48]. When we examine demographicsmore carefully in RQ2, we find that Democrats, males, ages 26–30, and people with higher educationlevels across both crowd groups have greater alignment with experts. However, some resultsare likely specific to the topic, given that the Republican platform currently questions climatechange. Other factors such as gender may be ones in which it may not be desirable to have biasedrepresentation.

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020.

Delving into article types was the focus of RQ3, which laid some groundwork for task suitability.When it comes to genre , both groups of crowd raters were more correlated with experts on opinionarticles. Along political lines, crowd groups were more correlated with experts on articles fromleft-leaning sources. These results suggest that the crowd may have the ability to replace experts’annotations in certain article types but not others. In addition, it may be that some difficultiesfor raters arise from the lack of visual cues such as genre labeling in U.S. mainstream media [33].Without being labeled or well understood, readers might need to rely on structural aspects such asarticle genre classification when the style is difficult to interpret, and experts themselves cannotalways agree. Finally, given our findings in RQ4, some news articles that conduct original researchor report on new scientific findings might require subject matter experts who can assess accuracy.

While we cannot expect crowds to always be capable of evaluating

Accuracy , results from RQ4pointed to more attainable ways to evaluate news articles that experts also use, namely the inclusionof

Credible Evidence/Grounding used by scientists and

Publication Reputation used byjournalists. Though over-emphasis on

Publication Reputation by journalism experts may seemto be a red flag, it is a way for the non-experts of a domain such as climate science to initiate theirinvestigation of credibility, much as scientists have preconceptions about the work from certainscholarly journals over others.Given our findings that domain experts use different criteria to judge credibility and that thesedifferences may surface among crowds, a future line of work could seek to reduce crowd disagree-ment both within itself and with certain experts by aligning to a particular set of domain-relevantcriteria. For example, one might ask the crowd to label specific components of an article that may signal credibility, rather than broadly asking about credibility itself. This forces raters to focuson aspects such as

Publication Reputation or Credible Evidence/Grounding that align withexpert assessments as opposed to allowing raters to reduce the broad credibility question into ascale according to a dimension of their own choosing or instinct.Indeed, prior research has shown that crowds perform well at assessing publication reputa-tion [53], and there exists a wide set of such source and message characteristics or signals ofpotential trust from a reader’s perspective, ranging from organization standards to the reader’scapacity for engagement [13, 17, 22, 42, 55, 56]. Other work has examined features directly in thearticle that may signal credibility—including title structure and proper noun [31], article content(e.g., emotional tone) or context (e.g., citation to reputable sources) [74]—as well as secondary char-acteristics (e.g., source attractiveness [50]). Even in the news credibility context, research indicatesthat crowd and journalists’ evaluation of information accuracy differ in their incorporation ofsignals [10]. If a complex construct like credibility can be distilled into a cluster of simpler signals,such as the perception of emotion in an article’s title, and further designed to be in alignmentwith expert judgments, annotators may prove to be far more reliable in the completion of thosetasks rather than more complicated assessments. This sub-task strategy is familiar in complexcrowdsourcing systems [37].To pilot this concept, we applied this reasoning to our own study, which asked the crowd groups toadditionally label four credibility signals that had previously been identified as potential indicatorsof expert credibility [74] on all 50 articles. Again, because our scope was for large-scale credibilityassessment, we also looked for quickly answerable questions. The questionnaire included threesignals related to the title of the article: the degree to which it is “clickbait”, its level of emotion, andthe representativeness of the title in comparison to the rest of the article. The fourth asked about

No of Users C o rr e l a t i o n b e t w ee n A nn o t a t i o n a n d E x p e r t [ sc i e n c e ] C r e d i b ili t y StudentUpwork Title ClickbaitnessTitle RepresentativenessTitle EmotionDek Emotion (a) science expert

No of Users C o rr e l a t i o n b e t w ee n A nn o t a t i o n a n d E x p e r t [ j o u r n a li s m ] C r e d i b ili t y (b) journalism expertFig. 9. Correlation between article signals and 2 expert groups’ credibility rating. C o rr e l a t i o n b e t w ee n A nn o t a t i o n a n d E x p e r t [ j o u r n a li s m ] C r e d i b ili t y Opinion

No of UsersAnalysis

News

StudentUpworkTitle ClickbaitnessTitle RepresentativenessTitle EmotionDek Emotion

Fig. 10. Correlation between article signals and journalism expert credibility rating for opinion/analysis/news. the level of emotion in the “dek”, or short summary of the article beneath the title . Seen throughthe work of RQ4, the emotion signals relate to our expert category of Impartiality while ourrepresentativeness of title and clickbait title correspond to our expert categories of

Accuracy and

Professionalized Practices and Standards .We report that the signals we tested overall resulted in poor to moderate correlations withexpert credibility assessments. We found that crowd groups’ title representativeness scores had thehighest alignment with experts’ credibility ratings (see Figure 9(a) and 9(b)), yet had the lowest IRRbetween raters (0.16 for

Student and 0.19 for

Upwork ). In contrast, both the emotion questions hadlow correlation with experts’ credibility ratings but had the highest IRR (e.g., for title emotion, IRRwas 0.47 for

Student and 0.52 for

Upwork ). We also saw that overall, Upwork workers have highercorrelation to both groups of experts in comparison to

Students , except for title representativenessagainst the journalism experts.But now, taking our qualitative categories and subcategories based upon expert rationales, wemight frame the question differently, seeking “misleading headline” or “sensationalist headline”under the

Accuracy category in addition to preferencing other categories altogether. Anotherstrategy may be to combine our insights regarding article type along with signals for credibility.Figure 10 shows the correlation between journalists’ credibility rating and crowd ratings on credi-bility signals broken down across three article genres. Crowd workers have higher correlation withjournalism experts on all our signals for

Opinion and

Analysis articles in contrast to

News . This See Appendix D for how we defined these terms for the crowds.Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. pattern suggests some of the signals can be more useful in particular cases (e.g., title representa-tiveness has a correlation as high as 0.6 for

Opinion and

Analysis articles, while title clickbaitness isthe most correlated signal for the

News genre). More work is needed to find credibility signals thatalign well with expert criteria and that are also stable among some subset of the crowd before thisapproach can be practically used in production crowdsourced processes.

Our current work has implications for designing processes for crowdsourcing news credibility. Wesummarize them below.

Designers have the opportunity to control theparticipants involved in crowdsourcing at two levels: demographic filtering during recruitmentand training for secondary improvement. In other words, our results imply that a combination of person-oriented strategies (e.g., filtering by demographics), followed by process-centric strategies(e.g., training raters by emphasizing what signals they should consider) can facilitate high-quality,at-scale credibility assessment. These results are in line with prior work pointing at the advantagesof person- and process-centric strategies for crowd-sourcing qualitative coding [47], which includestasks that are often quite subjective in nature and, thus, prone to conflicting interpretations. Ourapproach to credibility of news articles is indeed a blend of subjective and objective assessments.Based on the performance of the two crowd groups with differing demographic backgrounds, ourfindings also suggest that 15 ratings provide enough stability in the result. However, the differencein errors based on background suggests that recruiters can employ certain person-centric filteringmechanisms to enforce specific criteria in their systems. For example, filtering out certain educationlevels may serve some purpose for the system designers. At the same time, designers should beaware of how such a filtering mechanism may bias the system. Additionally, criteria used by theexpert groups (demonstrated in our RQ4 results), could serve as training for the crowd, offeringa host of process-oriented tactics for designers to employ on their crowd-rater workforce. Forexample, questionnaires can be devised to identify a baseline of crowd raters’ expertise in credibilityevaluation. Based on the expertise, different training mechanisms can be targeted towards eachgroup to improve their deficiency (e.g., literacy programs to improve accuracy or impartialityidentification). This training should not be limited towards understanding only the principles ofjournalism; rather it could show how subject matter experts identify and distill reliable evidence.

Given the differences in evaluation criteriaand corresponding credibility ratings, designers have to consider which group of experts theywant to emulate in the system. This consideration is in effect throughout the design process. Forinstance, to emulate behavior close to the journalism experts, system developers may employspecific strategies in their recruitment process. However, desirable expertise can vary case-by-case.For example, with science news, it might be desirable to have crowd ratings closer to a scienceexpert’s understanding of the subject matter (

Credible Evidence/Grounding ) while for breakinginformation news consumers may appreciate ratings that reflect journalistic expertise in verifyingsource quality (

Publication Reputation ). A greater understanding of desirable expertise indifferent news stories would further help future design.

Analysis of the article types suggests that some articles may requiresubject matter expertise while others may be reliably assessed by the crowd. For article typeswhere crowds have high disagreement with experts, alternate approaches can be devised includingtraining focused on particular flaws or a very tailored set of questions. Our study focused on atopic area in which U.S. Democrats have been shown to have a stronger relationship to credible

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:21 information. A closer examination of other topics, with different kinds of polarization and expertise,is needed. Vaccine hesitancy, for instance, is an issue with traction across the political spectrum asare a number of conspiracy theories, with the former attitude now represented in general crowdsin contrast to the continued fringe nature of the latter [34, 70]. The relationship between credibilityand a strong partisan perspective may not exist in these cases, but we may find other factors ofbelief such as extremism, in addition to relevant expert criteria that can frame tasks better. Anotherconsideration in task suitability is the task difficulty where even expert groups diverge. In thesecases, policy decisions may need to be made regarding which expertise is more relevant beforedesigning crowd tasks.Overall, a successful crowdsourcing approach requires that tasks to be designed carefully withspecific crowds, content, and experts all in mind. In section 5.1 we propose an approach that focuseson signals as an example. We note that this approach requires us to find signals that not onlyalign with expert judgments but also are possible for crowds to locate and assess in a reliable andconsistent manner.

Our analysis suffers from several limitations. First, the result is limited by our dataset derivedfrom only a popular set of articles on a particular topic of science news, an area in which domainexperts largely agree. Making general claims from such limited data would be inaccurate as wehave explained throughout, but we can infer that the corresponding relationship among expertsand raters might be at least as complicated as we found, if not more so. Second, annotationfrom our recruited populations of both crowd and expert groups are also limited by quantityand demographic/expertise background. In particular, differences between our two expert groupscould have been due to our restricted sample. Third, some of the expert results are drawn froman incomplete set of expert criteria, and our codebook for expert criteria is constructed from theauthors’ interpretation; thus, conclusions there also have their limitations.

In this work, using the domain of climate news, we dive into the notion of crowdsourcing credibilitythrough a series of analyses on its main components: the makeup of the crowd , the scope of tasks that the crowd is assigned, and the subject area expert criteria in question. In particular, we explorecharacteristics of the “crowd,” in terms of traits such as background, demographics, and politicalleaning, and whether they have bearings on task performance. We show this in a comparisonbetween ratings made by students and others recruited through journalism networks versus crowdworkers on UpWork. We also interrogate the nature of the crowdsourcing task itself, finding thatthe genre of the article and partisanship of the publication has different relationship to both crowdsand experts. This led us to better understand the reasoning of experts themselves. In our case,we looked at how experts in journalism versus experts in science have different ways to assessarticle credibility based on the factors such as

Credible Evidence/Grounding and

PublicationReputation . Disagreement among raters is neither always bad nor always about their capacities,but at times about suitability of the task [2] and about the particular subject area expertise inquestion as well. By investigating the variability introduced by all these components, we pointtowards how the design of crowd assessments to approximate expert-level credibility can be mademore robust.

This paper would not be possible without the valuable support of the Credibility Coalition, withspecial thanks to Caio Almeida, An Xiao Mina, Jennifer 8. Lee, Rick Weiss, Kara Laney, and especially

Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020.

Dwight Knell. Bhuiyan and Mitra were partly supported through National Science Foundationgrant

REFERENCES [1] William RL Anderegg, James W Prall, Jacob Harold, and Stephen H Schneider. 2010. Expert credibility in climatechange.

Proceedings of the National Academy of Sciences

AIMagazine

36, 1 (2015), 15âĂŞ24.[3] Mahmoudreza Babaei, Abhijnan Chakraborty, Juhi Kulshrestha, Elissa M Redmiles, Meeyoung Cha, and Krishna PGummadi. 2019. Analyzing Biases in Perception of Truth in News Stories and Their Implications for Fact Checking..In

FAT . 139.[4] Mevan Babakar. 2018. Crowdsourced Factchecking.[5] Betsy Jane Becker. 1994. Combining significance levels.

The handbook of research synthesis (1994), 215–230.[6] Joshua Becker, Ethan Porter, and Damon Centola. 2019. The wisdom of partisan crowds.

Proceedings of the NationalAcademy of Sciences of the United States of America

The Guardian

Nature communications

10, 1 (2019), 7.[9] Mohamad Adam Bujang and Nurakmal Baharum. 2016. Sample size guideline for correlation analysis.

World

3, 1(2016).[10] Cody Buntain and Jennifer Golbeck. 2017. Automatically Identifying Fake News in Popular Twitter Threads.

Proceed-ings - 2nd IEEE International Conference on Smart Cloud, SmartCloud 2017 (2017), 208–215. https://doi.org/10.1109/SmartCloud.2017.40 arXiv:1705.01613[11] Davide Ceolin. 2019. Conference Presentation: On the Quality of Crowdsourced Information Quality Assess-ments. https://drive.google.com/a/hackshackers.com/file/d/1AJmFmRqEhdhSIZLwhXT_1bzStXfV-hVf/view?usp=drive_open&usp=embed_facebook[12] Steven H Chaffee. 1982. Mass media and interpersonal channels: Competitive, convergent, or complementary.

In-ter/media: Interpersonal communication in a media world

57 (1982), 77.[13] Shelly Chaiken. 1987. The heuristic model of persuasion. In

Social influence: the ontario symposium , Vol. 5. Hillsdale,NJ: Lawrence Erlbaum, 3–39.[14] Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. 2000. The mahalanobis distance.

Chemometricsand intelligent laboratory systems

50, 1 (2000), 1–18.[15] Jaap J Dijkstra, Wim BG Liebrand, and Ellen Timminga. 1998. Persuasiveness of expert systems.

Behaviour &Information Technology

17, 3 (1998), 155–163.[16] Ziv Epstein, Gordon Pennycook, and David Rand. 2020. Will the Crowd Game the Algorithm? Using LaypersonJudgments to Combat Misinformation on Social Media by Downranking Distrusted Sources. In

Proceedings of the 2020CHI Conference on Human Factors in Computing Systems (CHI âĂŹ20) . Association for Computing Machinery, 1âĂŞ11.https://doi.org/10.1145/3313831.3376232[17] Jonathan St BT Evans. 2008. Dual-processing accounts of reasoning, judgment, and social cognition.

Annu. Rev. Psychol.

Digital media, youth, and credibility (2008), 5–27.[21] Fabrice Florin. 2010. Crowdsourced Fact-Checking? What We Learned from Truthsquad.

Mediashift (2010).[22] Brian J Fogg. 2003. Prominence-interpretation theory: Explaining how people assess credibility online. In

CHI’03extended abstracts on human factors in computing systems . Citeseer, 722–723.[23] Brian J Fogg and Hsiang Tseng. 1999. The elements of computer credibility. In

Proceedings of the SIGCHI conference onHuman Factors in Computing Systems . 80–87.[24] American Press Institute & The AP-NORC Center for Public Affairs Research. 2018. Americans and the news media:What they do–and donâĂŹt–understand about each other.

The Media Insight Project (2018).Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:23 [25] Cary Funk, Meg Hefferon, Brian Kennedy, and Courtney Johnson. 2019. Trust and Mistrust in AmericansâĂŹ Views ofScientific Experts. (2019).[26] Cecilie Gaziano and Kristin McGrath. 1986. Measuring the concept of credibility.

Journalism quarterly

63, 3 (1986),451–462.[27] Emma Grillo. 2020. What Does a Sports Desk Do When Sports Are on Hold?

The New York Times

Science

Information Processing & Management

44, 4 (2008), 1467–1484.[31] Benjamin D. Horne and Sibel Adali. 2017. This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Contentin Text Body, More Similar to Satire than Real News. (2017), 759–766. arXiv:1703.09398 http://arxiv.org/abs/1703.09398[32] Carl Iver Hovland, Irving Lester Janis, and Harold H Kelley. 1953. Communication and persuasion. (1953).[33] Rebecca Iannucci and Bill Adair. 2017. ReportersâĂŹ Lab Study Results: Effective News Labeling and Media Literacy.[34] Jonathan Kennedy. 2019. Populist politics and vaccine hesitancy in Western Europe: an analysis of national-level data.

European Journal of Public Health

29, 3 (Jun 2019), 512âĂŞ516. https://doi.org/10.1093/eurpub/ckz004[35] Gary King and Richard Nielsen. 2019. Why propensity scores should not be used for matching.

Political Analysis

27, 4(2019), 435–454.[36] Spiro Kiousis. 2001. Public trust or mistrust? Perceptions of media credibility in the information age.

Mass communica-tion & society

4, 4 (2001), 381–403.[37] Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E Kraut. 2011. Crowdforge: Crowdsourcing complex work. In

Proceedings of the 24th annual ACM symposium on User interface software and technology . ACM, 43–52.[38] Michael Lucibella. 2009. Science Journalism Faces Perilous Times.

American Physical Society (APS) News

Journal of personality and socialpsychology (2014).[40] Albert E. Mannes, Richard P. Larrick, and Jack B. Soll. 2012. The social psychology of the wisdom of crowds.[41] Aaron M. McCright, Katherine Dentzman, Meghan Charters, and Thomas Dietz. 2013. The influence of politicalideology on trust in science.

Environmental Research Letters

8, 4 (Nov 2013), 044029.[42] Miriam J Metzger. 2007. Making sense of credibility on the Web: Models for evaluating online information andrecommendations for future research.

Journal of the American Society for Information Science and Technology

58, 13(2007), 2078–2091.[43] Miriam J Metzger, Ethan H Hartsell, and Andrew J Flanagin. 2015. Cognitive dissonance or credibility? A com-parison of two theoretical explanations for selective exposure to partisan news.

Communication Research (2015),0093650215613136.[44] Philip Meyer. 1988. Defining and measuring credibility of newspapers: Developing an index.

Journalism quarterly

Can Americans Tell Factual From OpinionStatements in the News? [46] Tanushree Mitra and Eric Gilbert. 2015. CREDBANK: A Large-Scale Social Media Corpus with Associated CredibilityAnnotations. In

Proc. ICWSM’15 .[47] Tanushree Mitra, Clayton J Hutto, and Eric Gilbert. 2015. Comparing person-and process-centric strategies forobtaining quality data on amazon mechanical turk. In

Proc. CHI’15 . ACM, 1345–1354.[48] Kevin Munger, Mario Luca, Jonathan Nagler, and Joshua Tucker. 2019. Age matters: Sampling strategies for studyingdigital media effects.[49] American Society of Newspaper Editors. 1975. ASNE Statement of Principles. https://members.newsleaders.org/content.asp?pl=24&sl=171&contentid=171. (Accessed on 01/14/2020).[50] Daniel J O’Keefe. 2008. Persuasion.

The International Encyclopedia of Communication (2008).[51] Sheila O’Riordan, Gaye Kiely, Bill Emerson, and Joseph Feller. 2019. Do you have a source for that? Understanding theChallenges of Collaborative Evidence-based Journalism. In

Proceedings of the 15th International Symposium on OpenCollaboration . 1–10.[52] Gordon Pennycook, Tyrone Cannon, and David G. Rand. 2018.

Prior Exposure Increases Perceived Accuracy of FakeNews . Number ID 2958246. https://papers.ssrn.com/abstract=2958246Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. [53] Gordon Pennycook and David G. Rand. 2019. Fighting misinformation on social media using crowdsourced judgmentsof news source quality.

Proceedings of the National Academy of Sciences

Who Falls for Fake News? The Roles of Bullshit Receptivity, Overclaiming,Familiarity, and Analytic Thinking . Number ID 3023545. https://papers.ssrn.com/abstract=3023545[55] Richard E Petty and John T Cacioppo. 1986. The elaboration likelihood model of persuasion. In

Communication andpersuasion . Springer, 1–24.[56] The Trust Project. 2017. Collaborator Materials. https://thetrustproject.org/collaborator-materials/. (Accessed on01/14/2020).[57] Soo Young Rieh and David R Danielson. 2007. Credibility: A multidisciplinary framework.

Annual review of informationscience and technology

41, 1 (2007), 307–364.[58] Robert M. Ross, David G. Rand, and Gordon Pennycook. 2019. Beyond âĂĲfake newsâĂİ: The role of analytic thinkingin the detection of inaccuracy and partisan bias in news headlines. (2019), 1–22.[59] Linda Schamber. 1991. Users’ Criteria for Evaluation in a Multimedia Environment.. In

Proceedings of the ASIS AnnualMeeting , Vol. 28. ERIC, 126–33.[60] Dietram A Scheufele and Nicole M Krause. 2019. Science audiences, misinformation, and fake news.

Proceedings of theNational Academy of Sciences

Proceedings of the 10th ACM Conference on Web Science (WebSci âĂŹ19) .ACM, 287âĂŞ288. https://doi.org/10.1145/3292522.3326055 event-place: Boston, Massachusetts, USA.[62] Art Silverblatt, Donald C. Miller, Julie Smith, and Nikole Brown. 2014.

Media Literacy: Keys to Interpreting MediaMessages, 4th Edition: Keys to Interpreting Media Messages . ABC-CLIO.[63] Henry Silverman. 2019. Helping Fact-Checkers Identify False Claims Faster - About Facebook. https://about.fb.com/news/2019/12/helping-fact-checkers/. (Accessed on 01/10/2020).[64] Julianne Stanford, Ellen R Tauber, BJ Fogg, and Leslie Marable. 2002.

Experts vs. online consumers: A comparativecredibility study of health and finance Web sites . Consumer Web Watch.[65] Anselm Strauss and Juliet Corbin. 1994. Grounded theory methodology.

Handbook of qualitative research

17 (1994),273–85.[66] S Shyam Sundar. 1999. Exploring receivers’ criteria for perception of print and online news.

Journalism & MassCommunication Quarterly

76, 2 (1999), 373–386.[67] S Shyam Sundar. 2008. The MAIN model: A heuristic approach to understanding technology effects on credibility.

Digital media, youth, and credibility

Harvard Business Review (Sep 2006). https://hbr.org/2006/09/when-crowds-arent-wise[69] James Surowiecki. 2004.

The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective WisdomShapes Business, Economies, Societies and Nations . Doubleday.[70] Jan-Willem van Prooijen, AndrÃľ P. M. Krouwel, and Thomas V. Pollet. 2015. Political Extremism Predicts Belief inConspiracy Theories.

Social Psychological and Personality Science

6, 5 (Jul 2015), 570âĂŞ578. https://doi.org/10.1177/1948550614567356[71] Christian Wagner and Ayoung Suh. 2014. The Wisdom of Crowds: Impact of Collective Size and Expertise Transfer onCollective Performance. (Jan 2014), 594âĂŞ603. https://doi.org/10.1109/HICSS.2014.80[72] Lorraine Whitmarsh. 2011. Scepticism and uncertainty about climate change: Dimensions, determinants and changeover time.

Global Environmental Change

21, 2 (May 2011), 690âĂŞ700.[73] Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, Nada Hashmi, and Thomas W. Malone. 2010. Evidencefor a Collective Intelligence Factor in the Performance of Human Groups.

Science

Companion Proceedings of The Web Conference 2018 . 603–612.Proceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. nvestigating Differences in Crowdsourced News Credibility Assessment: Raters, Tasks, andExpert Criteria 93:25

No of Users Sampled from Crowd Group C o rr e l a t i o n o f C r e d i b ili t y Student vs Expert[science]Upwork vs Expert[science]Student vs Expert[journalism]Upwork vs Expert[journalism]Student vs UpworkExpert[journalism] vs Expert[science]

Fig. 11. Correlation of credibility ratings on the matched data. The lines show correlation among all pairs infour groups: 2 crowd and 2 expert groups. In each crowd group, we sample the number of raters from 1–25.For expert groups, we take all 3 ratings. Then we compute the Spearman ρ between the mean responses fromeach group on all 50 articles. The plot shows average ρ after 100 resamplings. A RQ1: COMPARING UPWORK AND STUDENT CROWD RATERS TO EXPERTSWHEN CROWD GROUPS ARE CONTROLLED ON DEMOGRAPHY

The difference between the two crowd groups in RQ1 analysis could have been a result of someunderlying issues including demographic variances. To account for this issue, we also performedRQ1 analysis by controlling for four demographic factors including Gender, Age, Education andPolitical Alignment. For this purpose, we created matching participants between Upwork workersand Students using

Match function from R package Matching . Because we had smaller number ofusers in the Upwork group, we matched them against Students with ties handled randomly. Due toduplicates, this method resulted in 21 unique students retained out of 49. We utilized MahalnobisDistance as the matching criteria instead of

Propensity Score because a recent work suggeststhat

Propensity Score matching increases imbalance rather than decreasing [14, 35]. Figure 11shows the correlation between and within crowd and expert groups on the matched data.

B NEWS ARTICLE DISTRIBUTION

Website

Table 7. Article distribution from the sources. https://sekhon.berkeley.edu/matching/Match.htmlProceedings of the ACM on Human-Computer Interaction, Vol. 4, No. CSCW2, Article 93. Publication date: October 2020. C SAMPLE OF EXPERT NOTES & QUALITATIVE CODES

Note Codes “A neutral discussion about the fight between left and rightwing partisan on US President (lack of) role in the hurricaneFlorence disaster.” (Impartiality) neutral, nonpartisantone/lack of attacks or injectedopinion[+]“This story fails to include comments from independentscientists in the field or to provide necessary context forreaders. For example, the study fails to account for morerecent volcanic activity, and does not support its conclusionthat climate models are overly sensitive to CO2. In addition,the storyâĂŹs headline emphasizes that the study showsâĂĲno acceleration in global warming for 23 yearsâĂİand this is presented as a challenge to model simulations.This is misleading, as no acceleration of the warmingrate is expected to be seen in such a short timeframe.https://climatefeedback.org/evaluation/daily-caller-uncritically-reports-misleading-satellite-temperature-study-michael-bastasch/” (Credible Evidence/Grounding) lack ofquotes from experts[-](Accuracy) misleading headline[-](Originality and Insight) poor interpre-tation/uninformed implications[-](Completeness of Coverage) lack ofcontext[-](Completeness of Coverage)light/cursory coverage[-]

Table 8. Sample notes from the experts and corresponding codes (high-level categories inside brackets). Thecolors show correspondence between Note and Codes.

D DEFINING SIGNALS

Clickbait.

We provided following categories as examples of clickbait. • Listicle ("6 Tips on ...") • Cliffhanger to a story ("You Won’t Believe What Happens Next") • Provoking emotions, such as shock or surprise ("...Shocking Result", "...Leave You in Tears") • Hidden secret or trick ("Fitness Companies Hate Him...", "Experts are Dying to Know TheirSecret") • Challenges to the ego ("Only People with IQ Above 160 Can Solve This") • Defying convention ("Think Orange Juice is Good for you? Think Again!", "Here are 5 FoodsYou Never Thought Would Kill You") • Inducing fear ("Is Your Boyfriend Cheating on You?")

Representativeness.

We suggested following categories on how an article can be unrepresenta-tive. • Title is on a different topic than the body • Title emphasizes different information than the body • Title carries little information about the body • Title takes a different stance than the body • Title overstates claims or conclusions in the body • Title understates claims or conclusions in the body