[PDF] How Fake News Affect Trust in the Output of a Machine Learning System for News Curation

Abstract

People are increasingly consuming news curated by machine learning (ML) systems. Motivated by studies on algorithmic bias, this paper explores which recommendations of an algorithmic news curation system users trust and how this trust is affected by untrustworthy news stories like fake news. In a study with 82 vocational school students with a background in IT, we found that users are able to provide trust ratings that distinguish trustworthy recommendations of quality news stories from untrustworthy recommendations. However, a single untrustworthy news story combined with four trustworthy news stories is rated similarly as five trustworthy news stories. The results could be a first indication that untrustworthy news stories benefit from appearing in a trustworthy context. The results also show the limitations of users' abilities to rate the recommendations of a news curation system. We discuss the implications of this for the user experience of interactive machine learning systems.

Full PDF

HHow Fake News Aﬀect Trust in the Output of aMachine Learning System for News Curation

Hendrik Heuer , − − − andAndreas Breiter , − − − University of Bremen, Institute for Information Management, Germany { hheuer,abreiter } @ifib.de Abstract.

People are increasingly consuming news curated by machinelearning (ML) systems. Motivated by studies on algorithmic bias, thispaper explores which recommendations of an algorithmic news curationsystem users trust and how this trust is aﬀected by untrustworthy newsstories like fake news. In a study with 82 vocational school students witha background in IT, we found that users are able to provide trust ratingsthat distinguish trustworthy recommendations of quality news storiesfrom untrustworthy recommendations. However, a single untrustworthynews story combined with four trustworthy news stories is rated similarlyas ﬁve trustworthy news stories. The results could be a ﬁrst indicationthat untrustworthy news stories beneﬁt from appearing in a trustworthycontext. The results also show the limitations of users’ abilities to rate therecommendations of a news curation system. We discuss the implicationsof this for the user experience of interactive machine learning systems.

Keywords:

Human-Centered Machine Learning, Algorithmic Experi-ence, Algorithmic Bias, Fake News, Social Media

News curation is the complex activity of selecting and prioritizing informationbased on some criteria of relevance and in regards to limitations of time andspace. While traditionally the domain of editorial oﬃces of newspapers and othermedia outlets, this curation is increasingly performed by machine learning (ML)systems that rank the relevance of content [3,18]. This means that complex,intransparent ML systems inﬂuence the news consumption of billions of users.Pew Research Center found that around half of U.S. adults who use Facebook(53%) think they do not understand why certain posts are included in their newsfeeds [2]. This motivates us to explore how users perceive news recommendationsand whether users can distinguish trustworthy from untrustworthy ML recom-mendations. We also examine whether untrustworthy news stories like fake newsbeneﬁt from a trustworthy context, for instance, when an ML system predictsﬁve stories, where four are trustworthy news stories and one is a fake news story.We operationalized the term fake news as “fabricated information that mimicsnews media content in form but not in organizational process or intent” [29]. a r X i v : . [ c s . H C ] A ug Heuer & Breiter

Investigating trust and fake news in the context of an algorithmic news curationis important since such algorithms are an integral part of social media platformslike Facebook, which are a key vector of fake news distribution [3]. Investiga-tions of trust in news and people’s propensity to believe in rumors has a longhistory [5,4].We focus on trust in a news recommender system, which connects to O’Donovanet al. and Massa and Bhattacharjee [36,42]. Unlike them, our focus is not thetrust in the individual items, but the trust in the ML system and its recommen-dations. The design of the study is shaped by how users interact with machinelearning systems. Participants rate their trust in the recommendations of a ma-chine learning system, i.e. they rate groups of news stories. Participants weretold that they are interacting with an ML system, i.e. that they are not sim-ply rating the content. We focus on trust because falling for fake news is notsimply a mistake. Fake news are designed to mislead people by mimicking newsmedia content. Our setting connects to human-in-the-loop and active machinelearning, where users are interacting with a live system that they improve withtheir actions [53,7,27]. In such settings, improving a news curation algorithm byrating individual items would require a lot of time and eﬀort from users. We,therefore, explore explicitly rating ML recommendations as a whole as a way togather feedback.An investigation of how ML systems and their recommendations are per-ceived by users is important for those who apply algorithmic news curation andthose who want to enable users to detect algorithmic bias in use. This is rele-vant for all human-computer interaction designers who want to enable users tointeract with machine learning systems. This investigation is also relevant forML practitioners who want to collect feedback from users on the quality of theirsystems or practitioners who want to crowdsource the collection of training datafor their machine learning models [21,59,54].In our experiment, participants interacted with a simple algorithmic newscuration system that presented them with news recommendations similar to acollaborative ﬁltering system [50,23]. We conducted a between-subjects studywith two phases. Our participants were recruited in a vocational school. Theyall had a technical background and were briefed on the type of errors that MLsystems can make at unexpected times. In the ﬁrst phase, participants ratedtheir trust in diﬀerent news stories. This generated a pool of news stories withtrust ratings from our participants. Participants rated diﬀerent subsets of newsstories, i.e. each of the news stories in our investigation was rated by some userswhile others did not see it. In the second phase, the algorithmic news curationsystem combined unseen news stories for each user based on each news stories’median trust rating. This means that the trust rating of a story is based onthe intersubjective agreement of the participants that rated it in the ﬁrst phase.This allowed us to investigate how the trust in individual stories inﬂuences thetrust in groups of news stories predicted by an ML system. We vary the numberof trustworthy and untrustworthy news stories in the recommendations to studytheir inﬂuence on the trust rating on an 11-point rating scale. Our main goal rust in ML Recommendations 3 is to understand the trust ratings of ML output as a function of the trust ofindividual news items for a machine learning system. In summary, this paperanswers the following three research questions: – Can users provide trust ratings for news recommendations of a machinelearning system (RQ1)? – Do users distinguish trustworthy ML recommendations from untrustworthyML recommendations (RQ2)? – Do users distinguish trustworthy ML recommendations from recommenda-tions that include one individual untrustworthy news story (RQ3)?We found that users are able to give nuanced ratings of machine learning rec-ommendations. In their trust ratings, they distinguish trustworthy from untrust-worthy ML recommendations, if all stories in the output are trustworthy or if allare untrustworthy. However, participants are not able to distinguish trustworthynews recommendations from recommendations that include one fake news story.Even though they can distinguish other ML recommendations from trustworthyrecommendations.

The goal of news recommendation and algorithmic news curation systems is tomodel users’ interests and to recommend relevant news stories. An early ex-ample of this is GroupLens, a collaborative ﬁltering architecture for news [50].The prevalence of opaque and invisible algorithms that curate and recommendnews motivated a variety of investigations of user awareness of algorithmic cura-tion [22,18,47,19]. A widely used example of such a machine learning system isFacebook’s News Feed. Introduced in 2006, Facebook describes the News Feed asa “personalized, ever-changing collection of posts from the friends, family, busi-nesses, public ﬁgures and news sources you’ve connected to on Facebook” [20].By their own account, the three main signals that they use to estimate the rel-evance of a post are: who posted it, the type of content, and the interactionswith the post. In this investigation, we primarily focus on news and fake news onsocial media and the impact of the machine learning system on news curation.Alvarado and Waern coined the term algorithmic experience as an analyticframing for making the interaction with and experience of algorithms explicit [6].Following their framework, we investigate the algorithmic experience of usersof a news curation algorithm. This connects to Shou and Farkas, who inves-tigated algorithmic news curation and the epistemological challenges of Face-book [55]. They address the role of algorithms in pre-selecting what appearsas representable information, which connects to our research question whetherusers can detect fake news stories.This paper extends on prior work on algorithmic bias. Eslami et al. showedthat users can detect algorithmic bias during their regular usage of online hotelrating platforms and that this aﬀects trust in the platform [19]. Our investigationis focused on trust as an important expression of users’ beliefs. This connects

Heuer & Breiter to Rader et al., who explored how diﬀerent ways of explaining the outputs ofan algorithmic news curation system aﬀects users’ beliefs and judgments [46].While explanations make people more aware of how a system works, they areless eﬀective in helping people evaluate the correctness of a system’s output.The Oxford dictionary deﬁnes trust as the ﬁrm belief in the reliability, truth,or ability of someone or something [15]. Due to the diverse interest in trust,there are many diﬀerent deﬁnitions and angles of inquiry. They range from trustas an attitude or expectation [51,49], to trust as an intention or willingness toact [37] to trust as a result of behaviour [14]. Trust was explored in a varietyof diﬀerent contexts, including, but not limited to intelligent systems [57,23],automation [40,39,30], organisations [37], oneself [35], and others [51]. Lee andSee deﬁne trust as an attitude of an agent with a goal in a situation that ischaracterized by some level of uncertainty and vulnerability [30]. The sociologistNiklas Luhmann deﬁned trust as a way to cope with risk, complexity, and a lackof system understanding [31]. For Luhmann, trust is what allows people to facethe complexity of the world. Other trust deﬁnitions cite a positive expectationof behavior and reliability [52,51,40].Our research questions connect to Cramer et al., who investigated trust in thecontext of spam ﬁlters, and Berkovsky et al., who investigated trust in movierecommender systems [10,13]. Cramer et al. found that trust guides reliancewhen the complexity of an automation makes a complete understanding imprac-tical. Berkovsky et al. argue that system designers should consider grouping therecommended items using salient domain features to increase user trust, whichsupports earlier ﬁndings by Pu and Chen [45]. In the context of online behavioraladvertising, Eslami et al. explored how to communicate algorithmic processes byshowing users why an ad is shown to them [17]. They found that users preferinterpretable, non-creepy explanations.Trust ratings are central to our investigation. We use them to measurewhether participants distinguish trustworthy from untrustworthy machine learn-ing recommendations and investigate the inﬂuence of outliers. A large numberof publications used trust ratings as a way to assess trust [40,39,33,44]. In thecontext of online news, Pennycook and Rand showed that users can rate trustin news sources and that they can distinguish mainstream media outlets fromhyperpartisan or fake news sources [44]. Muir et al. modeled trust in a machinebased on interpersonal trust and showed that users can meaningfully rate theirtrust [40]. In the context of a pasteurization plant simulation, Muir and Morayshowed that operators’ subjective ratings of trust provide a simple, nonintru-sive insight into their use of the automation [39]. Regarding the validity of suchratings, Cosley et al. showed that users of recommender system interfaces ratefairly consistently across rating scales and that they can detect systems thatmanipulate outputs [12]. rust in ML Recommendations 5

To explore trust in the context of algorithmic news curation, we conducted anexperiment with 82 participants from a vocational school with a focus on IT. Inthe ﬁrst phase of the study, participants with a technical background rated indi-vidual news stories, one at a time. In the second phase of the study, participantsrated ML recommendations, i.e. ﬁve news stories that were presented togetheras the recommendations of an algorithmic news curation system. The study wasconducted in situ via a web application that presented the two phases.We recruited a homogeneous group of participants in a German vocationalschool. To prevent a language barrier from adding bias, the experiment wasconducted in German. In Germany, the performance of students is strongly de-pendent on socio-economic factors [43]. Students of a vocational school, whichstarts after compulsory schooling, have a similar background. This allows us tocontrol for age, educational background, and socio-economic background. Themean age of the 82 participants was 21.40 (SD=3.92). The school had a strongSTEM focus: All of the six classes were trained in IT (but they had no formaltraining in machine learning). The IT focus of the vocational school introduced agender bias: 73 participants identiﬁed as male, 5 as female, 2 chose not to disclosetheir gender and 2 identiﬁed as a non-binary gender. This gender bias is repre-sentative of a vocational school with a STEM focus in Germany. In the trainingyear 2016, women only accounted for 7.9% of new IT trainees in Germany [1].Like Muir et al. and Cramer et al., we adopt Luhmann’s deﬁnition of trust as away to cope with risk, complexity, and a lack of system understanding [31,40,13].Our operationalization focuses on interpersonal and social trust, which can bedescribed as the generalized expectancy that a person can rely on the words orpromises of others [51]. When consuming news, a person is making herself orhimself reliant on a highly complex system that involves journalists, publishers,and interviewees. When interacting with an algorithmic news curation system,a person is making herself or himself reliant on a highly complex socio-technicalsystem, which cannot be understood entirely and which can malfunction formyriad reasons. Each part of the system poses a risk, either due to mistakes,misunderstandings, or malicious intent. A social media platform that performsalgorithmic news curation includes actors like the platform provider, the ad-vertisers, other users, and all the diﬀerent news sources with diﬀerent levels oftrustworthiness. All add complexity and risk. Understanding and auditing howthis socio-technical system works is neither possible nor practical.Before the experiment, we explained the rating interface, provided Mitchell’sdeﬁnition of ML, and brieﬂy mentioned ML applications like object detectionand self-driving cars. According to Mitchell, “a computer program is said to learnfrom experience E with respect to some class of tasks T and performance measureP if its performance at tasks in T, as measured by P, improves with experienceE” [38]. To illustrate this, we showed participants how an ML algorithm learnsto recognize hand-written digits. This was meant to show how and why somedigits are inevitably misclassiﬁed. Algorithmic news curation was introduced asanother machine learning application. The term fake news was illustrated using

Heuer & Breiter examples like Pope Francis backing Trump and the German Green party banningmeat.

The task in the ﬁrst phase was to provide trust ratings for news stories fromdiﬀerent sources. In this phase, participants evaluated each piece of contentindividually. As news stories, we used two days of publicly available Facebookposts of 13 diﬀerent sources. The study was conducted in May 2017, i.e. beforethe Cambridge Analytica scandal and before the Russian interference in the 2016United States elections became publicly known.We distinguish between seven quality media sources, e.g. public-service broad-casters and newspapers of record, and six biased sources, including tabloid me-dia and fake news blogs. The quality media sources and the tabloid sourceswere selected based on their reach as measured by Facebook likes. Fake newssources were selected based on mentions in news articles on German fake news [9].Tabloid newspapers are characterized by a sensationalistic writing style and lim-ited reliability. But, unlike fake news, they are not fabricated or intentionallymisleading. For our experiment, a weighted random sample of news stories wasselected from all available posts. Each of the 82 participants rated 20 news storiesfrom a weighted random sample consisting of eight quality media news stories,four tabloid news stories, and eight fake news stories. The weighted sample ac-counted for the focus on fake news and online misinformation. The selectedstories cover a broad range of topics, including sports like soccer, social issueslike homelessness and refugees, and stories on politicians from Germany, France,and the U.S.The presentation of the news stories resembled Facebook’s oﬃcial visual de-sign. For each news story, participants saw the headline, lead paragraph, leadimage, the name of the source, source logo, source URL, date and time, as wellas the number of likes, comments, and shares of the Facebook post. Participantswere not able to click on links or read the entire article. The data was not person-alized, i.e. all participants saw the same number of likes, shares, and commentsthat anybody without a Facebook account would have seen if s/he would havevisited the Facebook Page of the news source. In the experiment, participantrated news stories on an 11-point rating scale. The question they were asked foreach news story was: “Generally speaking, would you say that this news storycan be trusted, or that you can’t be too careful? Please tell me on a score of 0 to10, where 0 means you can’t be too careful and 10 means that this news storycan be trusted”. Range and phrasing of the question are modeled after the ﬁrstquestion of the Social Trust Scale (STS) of the European Social Survey (ESS)which is aimed at interpersonal trust and connected to the risk of trusting aperson respectively a news story [48]. After the experiment, the ratings of thenews stories from Phase 1 were validated with media research experts. Each me-dia researcher ranked the news sources by how trustworthy they considered thesource. These rankings were compared to the median trust ratings of the newssources by the users. The experts were recruited from two German labs with a rust in ML Recommendations 7 focus on media research on public communication and other cultural and socialdomains. All members of the two labs were contacted through internal newslet-ters. In a self-selection sample, nine media researcher (three male, six female)provided their ranking via e-mail (two from lab A, seven from lab B).

In the second phase, participants rated their trust in the output of a news cura-tion system. The task was not to identify individual fake news items. Participantsrated the ML recommendations as a group selected by an ML system. In thestudy, the output of the ML system always consisted of ﬁve unseen news sto-ries. We selected the unseen news stories based on their median trust ratingsfrom Phase 1. The median is used as a robust measure of central tendency [25],which captures intersubjective agreement and which limits the inﬂuence of in-dividual outliers. We adapted our approach from collaborative ﬁltering systemslike GroupLens [50,23]. Collaborative ﬁltering systems identify users with sim-ilar rating patterns and use these similar users to predict unseen items. Sinceour sample size was limited, we couldn’t train a state-of-the-art collaborativeﬁltering system. Therefore, we used the median trust rating as a proxy.Our goal was to understand how the presence of fake news changes the feed-back users give for a machine learning system and whether trust ratings accountfor the presence of fake news. Our motivation was to explore how ﬁne-grainedthe user feedback on a system’s performance is. This is important for ﬁelds likeactive learning or interactive and mixed-initiative machine learning [24,8,26,56],where user feedback is used to improve the system. While the experiment briefmade people believe that they were interacting with a personalized ML system,the recommendations were not actually personalized. We did this to be able tocompare the ratings. Unlike in Wizard of Oz experiments, there was no experi-menter in the loop. Users freely interacted with an interactive software systemthat learned from examples.

To investigate how the trust ratings of the recommendations change based onthe trustworthiness of the individual news stories, we combine ﬁve news storiesin random order with diﬀerent levels of trustworthiness. The scale ranges from“can’t be too careful (0)” to “can be trusted (10)”. We refer to the trustworthi-ness of a news story as low (if the trust rating is between 0 and 3), medium (4to 6), and high (7 to 10).Figure 1 shows the four types of news recommendations that we discuss inthis paper as well as the rating interface. – a) Medium — ML output that consists of ﬁve news stories with mediantrust ratings between 4 and 6. – b) Medium, 1 Low — ML output with four news stories with ratingsbetween 4 and 6 and one with a rating between 0 and 3 (shown in Figure 1).

Heuer & Breiter

Quality Media { Fake News { a) b) c) d) MEDIUMMEDIUMMEDIUMMEDIUMMEDIUM LOWMEDIUMMEDIUMMEDIUMMEDIUM LOWMEDIUMMEDIUMMEDIUMHIGH LOWLOWLOWLOWLOW

Types of News Recommendations, which combine ﬁve news stories by their median trust rating

MEDIUMMEDIUMMEDIUMMEDIUMLOW { Glauben Sie, dass man dieser Nachrichtensammlung vertrauen kann, oder dass man in diesem Fall nicht vorsichtig genug sein kann?

114 Likes 61 Comments 171 Shares

Fig. 1.

For Phase 2, diﬀerent types of ML recommendations were generated by com-bining ﬁve news stories from Phase 1 by their median trust rating. Participants ratedthe trustworthiness of these collections of unseen news stories with a single score onan 11-point rating scale.rust in ML Recommendations 9 – c) Medium, 1 Low, 1 High — ML output that consists of three mediumnews stories, one with a low trust rating and one with a high rating between7 and 10. – d) Low — ML output where all news stories have a trust rating between 0and 3.Our goal was to show as many diﬀerent combinations of news recommen-dations to participants as possible. Unfortunately, what we were able to testdepended on the news ratings in the ﬁrst phase. Here, only a small subset of par-ticipants gave high ratings. This means that news recommendations like

High,1 Low , as well as

Low, 1 High could not be investigated. Figure 1 shows thediﬀerent types of ML recommendations that were presented to more than tenparticipants. In the ﬁgure, the ﬁve stories are shown in a collapsed view. In theexperiment, participants saw each news story in its full size, i.e. the texts, im-ages and the number of shares, likes, and comments were fully visible for eachof the ﬁve news stories in the news recommendation. The news stories were pre-sented in a web browser where participants were able to scroll. Participant ratedthe news recommendation on the same 11-point rating scale as the individualnews items, where 0 was deﬁned as “you can’t be too careful” and 10 as “thiscollection of news stories can be trusted”.

In Phase 1, participants were presented with individual news stories, which theyrated one at a time. The news stories came from 13 diﬀerent news sources. Eachparticipant rated 20 news stories (8 quality media, 4 tabloid, and 8 fake newsstories). More than half (53.47%) of the trust ratings are rated as low (with arating between 0 and 3). 28.22% are rated as medium (rated 4, 5 and 6) and18.32% high (7 and 10).The ﬁrst goal of this section is to establish whether our method and the trustratings are valid. For this, we grouped the news stories by source and rankedthem by their median trust rating (Table 1). The most trustworthy news sourceis a conservative newspaper of record with a median trust rating of 6.0 (N=256).The least trustworthy news sources is a fake news blog with a median trust ratingof 1.0 (N=129). Participants distinguish quality media (Sources A to F) fromtabloid media and fake news blogs (G to M). There is one exception: Rank H is aquality media source - produced by the public-service television - which receiveda median trust rating of 4.0 and which is ranked between tabloid media andfake news. Unlike the other news sources, this median trust rating is only basedon one article and 25 responses. The median ratings of all other news sourcesare based on four or more news articles and more than 100 ratings per newssource (with a maximum of 258 ratings for 10 articles from source G). The fakenews outlets are ranked as I (9), K (11), and M (13).We validated the trust ratings of news items by comparing them to rank-ings of the news sources by nine media researchers (three male, six female), also

Table 1.

Quality media sources (marked in green) are distinguished from tabloid media(yellow) and fake news sources (red) in the Participants’ Ranking ( N = 82, mediantrust rating) and the Media Researchers’ Rankings ( N = 9). Trust Ratings News Sources Ranked By Media Researchers

Participants' Ranking Media Researchers' Rankings

Quality Media Sources Tabloid Media Sources Fake News Sources

Rank shown in Table 1. Unlike the vocational school students, the experts did notrate individual news stories but ranked the names of the news sources by theirtrustworthiness. With one exception, researchers made the same distinction be-tween quality media and biased media (fake news and tabloid media). Like ourparticipants, the experts did not distinguish tabloid media from fake news blogs.Overall, the comparison of the two rankings shows that the trust ratings of theparticipants correspond to expert opinion. This validates the results through asample diﬀerent in expertise, age, and gender. The experts have a background inmedia research and two-thirds of the experts were female (which counterbalancedthe male bias in the participants).

The ﬁrst research question was whether users can provide trust ratings for rec-ommendations of an algorithmic news curation system. We addressed this ques-tion with a between-subjects design where the samples are independent, i.e.diﬀerent participants saw diﬀerent news stories and news recommendations [32].Participants provided their trust ratings for the news stories and the news rec-ommendations on an 11-point rating scale. We analyzed this ordinal data usinga non-parametric test, i.e. we made no assumptions about the distance betweenthe diﬀerent categories. To compare the diﬀerent conditions and to see whetherthe trust ratings of the news recommendations diﬀer in statistically signiﬁcantways, we applied the Mann-Whitney U test (Wilcoxon Rank test) [34,32]. Likethe t-test used for continuous variables, the Mann-Whitney U test provides a rust in ML Recommendations 11 O cc u rr e n ce s Low Medium

Medium 1 Low Medium

Medium 1 Low 1 High Medium

Fig. 2.

Histograms comparing the trust ratings of the diﬀerent recommendations ofthe ML system. p-value that indicates whether statistical diﬀerences between ordinal variablesexist. Participants were told to rate ML recommendations. The framing of theexperiment explicitly mentioned that they are not rating an ML system, butone recommendation of an ML system that consisted of ﬁve news stories. Theresults show that participants can diﬀerentiate between such ML recommenda-tions. The ranking of the ML recommendations corresponds to the news itemsthat make up the recommendations. Of the four types of news recommendations,a) Medium recommendations, which consist of ﬁve news stories with a trust rat-ing between 4 and 6, have a median rating of 5.0. d) Low recommendations withﬁve news stories with a low rating (0 and 3), have a median trust rating of 3.0.The trust ratings of b) Medium, 1 Low recommendations, which combine fourtrustworthy stories and one untrustworthy, are rated considerably higher (4.5).ML recommendations that consist of three trustworthy news items, one untrust-worthy news items (rating between 0 and 3) and one highly trustworthy newsstory (7 and 10), received a median trust rating of 3.0.

Table 2.

The Mann-Whitney U test was applied to see whether statistically signiﬁcantdiﬀerences between the trust ratings of diﬀerent news recommendations exist (italic for p < .

05, bold for p < . .0008 Medium M., 1 Low 303.50 .2491Medium M., 1 Low, 1 High 358.50 .0204

M., 1 Low Low 619.50 .0024

M., 1 Low M., 1 Low, 1 High 801.50 .0618M., 1 L., 1 High Low 1141.50 .0250

The second research question was whether users can distinguish trustwor-thy from untrustworthy machine learning recommendations. To answer this, wecompare the trust ratings of a) Medium and d) Low recommendations. The trust-worthy a) Medium recommendations have the same median rating (5.0) as thequality media sources D and E. Untrustworthy d) Low recommendations witha median rating of 3.0 have the same rating as the tabloid news source J andthe fake news source K. The Mann-Whitney U test shows that participants re-liably distinguish between a) Medium and d) Low recommendations (U=258.5,p=.001). Figure 2 (left) shows the histogram of the a) Medium recommenda-tions, which resembles a normal distribution. 5 is the most frequent trust rating,followed by 8 and 2. The histogram of d) Low is skewed towards negative ratings.Here, 1 and 3 are the most frequent trust rating. Nevertheless, a large numberof participants still gave a rating of 6 or higher for d) Low recommendations. Alarge fraction also gave a) Medium recommendations a rating lower than 5.

The ﬁrst two research questions showed that technically advanced participantsare able to diﬀerentiate between trustworthy and untrustworthy ML recommen-dations in an experiment where they are primed to pay attention to individualfake news stories. The most important research question, however, was whetherusers distinguish trustworthy ML recommendations from recommendations thatinclude one fake news story in their ratings. For this, we compare the trustratings of a) Medium recommendations to those of b) 4 Medium, 1 Low rec-ommendations, which have a median trust rating of 4.5 (N=36). Compared toa) Medium at 5.0 (N=19), the median is slightly lower. Compared to the newssources, b) 4 Medium, 1 Low at 4.5 is similar to quality media (Source F) andtabloid media (Source G). The Mann-Whitney U test shows that the ratings forb) Medium, 1 Low recommendations are signiﬁcantly diﬀerent from d) Low rec-ommendations (U=619.5, p=.002). However, the diﬀerence between a) Mediumand b) 4 Medium, 1 Low is not statistically signiﬁcant (U=303.5, p=.249). Thismeans that the crucial fake news case, where a recommendation consists of fourtrustworthy news stories and one fake news story, is not distinguished in a statis-tically signiﬁcant way. The histogram in Figure 2 (center) shows that a) Mediumand b) Medium, 1 Low are very similar. Both resemble a normal distribution andboth have strong peaks at 5, the neutral position of the 11-point rating scale.a) Medium recommendations have strong peaks at 2 and 7, b) Medium, 1 Lowrecommendations have peaks at 3 and 7. To see whether participants are ableto distinguish the fake news case from other recommendations, we also compareb) 4 Medium, 1 Low recommendations to c) Medium, 1 Low, 1 High recom-mendations, which consist of three trustworthy news stories (rated between 4and 6), one highly trustworthy story (7 and 10) and one untrustworthy newsitem (0 and 3). The c) 3 Medium, 1 Low, 1 High recommendations are rated as3.0 (N=55). This is the same as d) Low recommendations (3.0). It is also muchlower than the ratings of b) Medium, 1 Low recommendations (4.5). In compar-ison to the median trust rating of the news sources, this places c) 3 Medium, rust in ML Recommendations 13

The study found that participants with a technical background can provide plau-sible trust ratings for individual news items as well as for groups of news itemspresented as the recommendations of an ML system. The ratings of the newsrecommendations correspond to the news stories that are part of the news recom-mendations. We further showed that the trust ratings for individual news itemscorrespond to expert opinion. Vocational school students and media researchersboth distinguish news stories of quality media sources from biased sources. Nei-ther experts nor participants placed the fake news sources at the end of therankings. These ﬁndings are highly problematic considering the nature of fakenews. Following Lazer et al.’s deﬁnition of fake news as fabricated informationthat mimics news media content in form but not in organizational process or in-tent [29], fake news are more likely to emulate tabloid media in form and contentthan quality media.We found that users can provide trust ratings for an algorithmic news cura-tion system when presented with recommendations of a machine learning sys-tem. Participants were able to assign trust ratings that diﬀerentiated betweennews recommendations in a statistically signiﬁcant way, at least when comparingtrustworthy from untrustworthy machine learning recommendations. However,the crucial fake news case was not distinguished from trustworthy recommen-dations. This is noteworthy since the ﬁrst phase of our study showed that usersare able to identify individual fake news stories. When providing trust ratingsfor groups of news items in the second phase, the presence of fake news did notaﬀect the trust ratings of the output as a whole. This is surprising since prior re-search on trust in automation reliance implies that user’s assessment of a systemchanges when the system makes mistakes [16]. Dzindolet et al. report that theconsequences of this were so severe that after encountering a system that makesmistakes, participants distrusted even reliable aids. In our study, one fake newsstory did not aﬀect the trust rating in such a drastic way. An untrustworthyfake news story did not lead to a very low trust rating for the news recommen-dation as a whole. The simplest explanation for this would be that the task istoo hard for users. Identifying a lowly trusted news story in the recommenda-tions of an algorithmic news curation system may overstrain users. A contrary indication against this explanation is that trustworthy and untrustworthy rec-ommendations can be distinguished from other news recommendations like thec) Medium, 1 Low, 1 High recommendations.Our ﬁndings could, therefore, be a ﬁrst indication that untrustworthy newsstories beneﬁt from appearing in a trustworthy context. Our ﬁndings are es-pecially surprising considering that the users have an IT background and wereprimed to be suspicious. If users implicitly trust fake news that appear in atrustworthy context, this would have far-reaching consequences. Especially sincesocial media is becoming the primary news sources for a large group of peo-ple [41]. The question whether untrustworthy news stories like fake news beneﬁtfrom a trustworthy context is directly connected to research on algorithmic ex-perience and the user awareness of algorithmic curation.Our understanding of the user experience of machine learning systems isonly emerging [22,18,47,19]. In the context of an online hotel rating platforms,Eslami et al. found that users can detect algorithmic bias during their regularusage of a service and that this bias aﬀects trust in the platform [19]. Thequestion, therefore, is why participants did not react to the fake news stories inour study in a similar way. Further research has to show what role the context ofour study - machine learning and algorithmic news curation - may have played.While framing eﬀects are known to aﬀect trust, our expectation was that theframing would have primed users to be overly cautious [33]. This would meanthat participants can distinguish them in the experiment, but not in the practice.This was not the case.In the instructions of the controlled experiment, we deﬁne the terms fake newsand machine learning. This increased algorithmic awareness and the expectationof algorithmic bias. It could also have inﬂuenced the perception and actionsof the participants by making them more cautious and distrusting. We showthat despite this priming and framing, participants were not able to provideratings that reﬂect the presence of fake news stories in the output. If peoplewith a technical background and a task framed like this are unable to do this,how could a layperson? Especially considering that participants were able todistinguish uniformly trustworthy from uniformly untrustworthy output. All thismakes the implications of our experiment on the UX of machine learning and howfeedback/training data needs to be collected especially surprising and urgent.This adds to a large body of research on algorithmic experience and algorithmicawareness [17,58,11].

Studying trust in machine learning systems for news curation is challenging.We had to simplify a complex socio-technical system. Our approach connects toa large body of research that applies trust ratings to study complex phenom-ena [40,39,33,44]. Since no ground truth data on the trustworthiness of diﬀerentnews stories was available, we designed a study that used the median trustratings of our participants as intersubjective agreement on the perceived trust- rust in ML Recommendations 15 worthiness of a news story. A real-world algorithmic news curation system ismore complex and judges the relevance of postings based on three factors: whoposted it, the type of content, and the interactions with the post [20]. Eventhough we recreated the design of Facebook’s News Feed, our setting was arti-ﬁcial. Interactions with the posts were limited, participants did not select thenews sources themselves and they did not see the likes, shares, and comments oftheir real Facebook “friends”. We focused on news stories and did not personal-ize the recommendations of the ML system. Further research could investigatehow the diﬀerent sources aﬀect the trust perception of news stories respectivelythe trust perception of ML recommendations. However, not personalizing theresults and focusing on news was necessary to get comparable results.We conducted the experiment in a German vocational school with an ITfocus. This limits biasing factors like age, educational background, and socio-economic background, but led to a strong male bias. We counteracted this biasby validating the trust ratings of news stories with nine media research experts- a heterogeneous group that is diﬀerent in age, gender (three male, six female),and background, which conﬁrmed our results. Prior research also implies thatthe ﬁndings from our sample of participants are generalizable despite the strongmale bias. A German study (N=1,011) from 2017 showed that age and genderhave little inﬂuence on experience with fake news, which is similar for all peopleunder 60, especially between 14-to-24-year olds and 25-to-44-year olds [28]. Theparticipants in this study had a background in IT, which could have inﬂuencedthe results. Prior work on algorithmically generated image captions showed thattechnical proﬁciency and education level do not inﬂuence trust ratings [33]. More-over, even if the technical background of the participants would have helped thetask, they were not able to provide nuanced ratings that accounted for untrust-worthy news items, which further supports our arguments.

Our study investigated how fake news aﬀect trust in the output of a machinelearning system for news curation. Our results show that participants distin-guish trustworthy from untrustworthy ML recommendations in signiﬁcantly dif-ferent trust ratings. Meanwhile, the crucial fake news case, where an individualfake news story appears among trustworthy news stories, is not distinguishedfrom trustworthy ML recommendations. Since ML systems make a variety oferrors that can be subtle, it is important to incorporate user feedback on theperformance of the system. Our study shows that gathering such feedback ischallenging. While participants are able to distinguish exclusively trustworthyfrom untrustworthy recommendations, they do not account for subtle but crucialdiﬀerences like fake news. Our recommendations for those who want to applymachine learning is, therefore, to evaluate how well users can give feedback be-fore training active learning and human-in-the-loop machine learning systems.Further work in other real-world scenarios is needed, especially since news rec-ommendation systems are constantly changing.

References

1. Fachinformatiker: IT-Berufsausbildung auf dem Arbeitsmarktsehr gefragt - Golem.de (2017),

2. Many Facebook users don’t understand its news feed(2019),

3. Allcott, H., Gentzkow, M.: Social media and fake news in the 2016 election. Tech.rep., National Bureau of Economic Research (2017)4. Allport, F.H., Lepkin, M.: Wartime rumors of waste and special privilege: whysome people believe them. The Journal of Abnormal and Social Psychology (1),3 (1945)5. Allport, G.W., Postman, L.: The psychology of rumor. (1947)6. Alvarado, O., Waern, A.: Towards algorithmic experience: Initial eﬀorts for socialmedia contexts. In: Proceedings of the 2018 CHI Conference on Human Factorsin Computing Systems. pp. 286:1–286:12. CHI ’18, ACM, New York, NY, USA(2018). https://doi.org/10.1145/3173574.3173860, http://doi.acm.org/10.1145/3173574.3173860

7. Amershi, S., Cakmak, M., Knox, W.B., Kulesza, T.: Power to the people: The roleof humans in interactive machine learning. AI Magazine (4), 105–120 (2014)8. Amershi, S., Cakmak, M., Knox, W.B., Kulesza, T.: Power to the people: The roleof humans in interactive machine learning. AI Magazine (4), 105–120 (2014)9. bento, Katharina H¨olter, S.L.: Fake News in Deutschland: Diese Webseitenmachen Stimmung gegen Merkel (2017),

10. Berkovsky, S., Taib, R., Conway, D.: How to recommend?: User trust factors inmovie recommender systems. In: Proceedings of the 22Nd International Conferenceon Intelligent User Interfaces. pp. 287–300. IUI ’17, ACM, New York, NY, USA(2017). https://doi.org/10.1145/3025171.3025209, http://doi.acm.org/10.1145/3025171.3025209

11. Binns, R., Van Kleek, M., Veale, M., Lyngs, U., Zhao, J., Shadbolt, N.: ’it’sreducing a human being to a percentage’: Perceptions of justice in algorith-mic decisions. In: Proceedings of the 2018 CHI Conference on Human Factorsin Computing Systems. pp. 377:1–377:14. CHI ’18, ACM, New York, NY, USA(2018). https://doi.org/10.1145/3173574.3173951, http://doi.acm.org/10.1145/3173574.3173951

12. Cosley, D., Lam, S.K., Albert, I., Konstan, J.A., Riedl, J.: Is Seeing Believing?: HowRecommender System Interfaces Aﬀect Users’ Opinions. In: Proceedings of theSIGCHI Conference on Human Factors in Computing Systems. pp. 585–592. CHI’03, ACM, New York, NY, USA (2003). https://doi.org/10.1145/642611.642713, http://doi.acm.org/10.1145/642611.642713

13. Cramer, H.S., Evers, V., van Someren, M.W., Wielinga, B.J.: Awareness, Train-ing and Trust in Interaction with Adaptive Spam Filters. In: Proceedings of theSIGCHI Conference on Human Factors in Computing Systems. pp. 909–912. CHI’09, ACM, New York, NY, USA (2009). https://doi.org/10.1145/1518701.1518839, http://doi.acm.org/10.1145/1518701.1518839

14. Deutsch, M.: Trust, trustworthiness, and the f scale. The Journal of Abnormal andSocial Psychology (1), 138 (1960)rust in ML Recommendations 1715. Dictionaries, O.: trust (2018), https://en.oxforddictionaries.com/definition/trust

16. Dzindolet, M.T., Peterson, S.A., Pomranky, R.A., Pierce, L.G., Beck, H.P.: The roleof trust in automation reliance. International journal of human-computer studies (6), 697–718 (2003)17. Eslami, M., Krishna Kumaran, S.R., Sandvig, C., Karahalios, K.: Communicatingalgorithmic process in online behavioral advertising. In: Proceedings of the 2018CHI Conference on Human Factors in Computing Systems. pp. 432:1–432:13. CHI’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3173574.3174006, http://doi.acm.org/10.1145/3173574.3174006

18. Eslami, M., Rickman, A., Vaccaro, K., Aleyasen, A., Vuong, A., Kara-halios, K., Hamilton, K., Sandvig, C.: ”i always assumed that i wasn’t re-ally that close to [her]”: Reasoning about invisible algorithms in news feeds.In: Proceedings of the 33rd Annual ACM Conference on Human Factorsin Computing Systems. pp. 153–162. CHI ’15, ACM, New York, NY, USA(2015). https://doi.org/10.1145/2702123.2702556, http://doi.acm.org/10.1145/2702123.2702556

19. Eslami, M., Vaccaro, K., Karahalios, K., Hamilton, K.: ”be careful; things can beworse than they appear”: Understanding biased algorithms and users’ behavioraround them in rating platforms. In: ICWSM. pp. 62–71 (2017)20. Facebook: Facebook News Feed (2018), https://newsfeed.fb.com/

21. Gulla, J.A., Zhang, L., Liu, P., ¨Ozg¨obek, O., Su, X.: The adressa datasetfor news recommendation. In: Proceedings of the International Conferenceon Web Intelligence. pp. 1042–1048. WI ’17, ACM, New York, NY, USA(2017). https://doi.org/10.1145/3106426.3109436, http://doi.acm.org/10.1145/3106426.3109436

22. Hamilton, K., Karahalios, K., Sandvig, C., Eslami, M.: A path to understandingthe eﬀects of algorithm awareness. In: CHI ’14 Extended Abstracts on HumanFactors in Computing Systems. pp. 631–642. CHI EA ’14, ACM, New York, NY,USA (2014). https://doi.org/10.1145/2559206.2578883, http://doi.acm.org/10.1145/2559206.2578883

23. Herlocker, J.L., Konstan, J.A., Riedl, J.: Explaining collaborative ﬁltering recom-mendations. In: Proceedings of the 2000 ACM conference on Computer supportedcooperative work. pp. 241–250. ACM (2000)24. Horvitz, E.J.: Reﬂections on challenges and promises of mixed-initiative interac-tion. AI Magazine (2), 3 (2007)25. Huber, P.J.: Robust statistics. In: International Encyclopedia of Statistical Science,pp. 1248–1251. Springer (2011)26. Kim, B.: Interactive and interpretable machine learning models for human machinecollaboration. Ph.D. thesis, Massachusetts Institute of Technology (2015)27. Kulesza, T., Burnett, M., Wong, W.K., Stumpf, S.: Principles of explanatorydebugging to personalize interactive machine learning. In: Proceedings of the20th International Conference on Intelligent User Interfaces. pp. 126–137. IUI’15, ACM, New York, NY, USA (2015). https://doi.org/10.1145/2678025.2701399, http://doi.acm.org/10.1145/2678025.2701399

28. Landesanstalt f¨ur Medien NRW (LfM): Fake news. Tech. rep., forsa (May 2017), https://bit.ly/2ya2gj0

29. Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer,F., Metzger, M.J., Nyhan, B., Pennycook, G., Rothschild, D., et al.: The scienceof fake news. Science (6380), 1094–1096 (2018)8 Heuer & Breiter30. Lee, J.D., See, K.A.: Trust in automation: Designing for appropriate reliance. Hu-man Factors: The Journal of the Human Factors and Ergonomics Society (1),50–80 (2004)31. Luhmann, N.: Trust and power. 1979. John Willey & Sons (1979)32. MacKenzie, I.S.: Human-Computer Interaction: An Empirical Research Perspec-tive. Morgan Kaufmann, Amsterdam (2013),

33. MacLeod, H., Bennett, C.L., Morris, M.R., Cutrell, E.: Understanding blindpeople’s experiences with computer-generated captions of social media im-ages. In: Proceedings of the 2017 CHI Conference on Human Factors inComputing Systems. pp. 5988–5999. CHI ’17, ACM, New York, NY, USA(2017). https://doi.org/10.1145/3025453.3025814, http://doi.acm.org/10.1145/3025453.3025814

34. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variablesis stochastically larger than the other. The annals of mathematical statistics pp.50–60 (1947)35. Marsh, S.P.: Formalising trust as a computational concept. Ph.D. thesis (1994)36. Massa, P., Bhattacharjee, B.: Using trust in recommender systems: an experimentalanalysis. In: International Conference on Trust Management. pp. 221–235. Springer(2004)37. Mayer, R.C., Davis, J.H., Schoorman, F.D.: An Integrative Model of Organi-zational Trust. The Academy of Management Review (3), 709–734 (1995).https://doi.org/10.2307/258792,

38. Mitchell, T.M.: Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edn.(1997)39. Muir, B.M., Moray, N.: Trust in automation. Part II. Experimental studies of trustand human intervention in a process control simulation. Ergonomics (3), 429–460 (Mar 1996). https://doi.org/10.1080/0014013960896447440. Muir, B.M.: Trust in automation: Part I. Theoretical issues in the study oftrust and human intervention in automated systems. Ergonomics (11), 1905–1922 (Nov 1994). https://doi.org/10.1080/00140139408964957, http://dx.doi.org/10.1080/00140139408964957

41. Newman, N., Fletcher, R., Kalogeropoulos, A., Levy, D.A., Nielsen, R.K.: Reutersinstitute digital news report 2017 (2017), https://ssrn.com/abstract=3026082

42. O’Donovan, J., Smyth, B.: Trust in recommender systems. In: Proceedings of the10th international conference on Intelligent user interfaces. pp. 167–174. ACM(2005)43. OECD: PISA 2006 (2007),

44. Pennycook, G., Rand, D.G.: Crowdsourcing judgments of news source quality(2018)45. Pu, P., Chen, L.: Trust building with explanation interfaces. In: Proceedings ofthe 11th International Conference on Intelligent User Interfaces. pp. 93–100. IUI’06, ACM, New York, NY, USA (2006). https://doi.org/10.1145/1111449.1111475, http://doi.acm.org/10.1145/1111449.1111475

46. Rader, E., Cotter, K., Cho, J.: Explanations as mechanisms for supporting al-gorithmic transparency. In: Proceedings of the 2018 CHI Conference on HumanFactors in Computing Systems. pp. 103:1–103:13. CHI ’18, ACM, New York, NY,USA (2018). https://doi.org/10.1145/3173574.3173677, http://doi.acm.org/10.1145/3173574.3173677 rust in ML Recommendations 1947. Rader, E., Gray, R.: Understanding user beliefs about algorithmic curation in thefacebook news feed. In: Proceedings of the 33rd Annual ACM Conference on Hu-man Factors in Computing Systems. pp. 173–182. CHI ’15, ACM, New York, NY,USA (2015). https://doi.org/10.1145/2702123.2702174, http://doi.acm.org/10.1145/2702123.2702174

48. Reeskens, T., Hooghe, M.: Cross-cultural measurement equivalence of generalizedtrust. Evidence from the European Social Survey (2002 and 2004). Social IndicatorsResearch (3), 515–532 (Feb 2008). https://doi.org/10.1007/s11205-007-9100-z, http://link.springer.com/article/10.1007/s11205-007-9100-z

49. Rempel, J.K., Holmes, J.G., Zanna, M.P.: Trust in close relationships. Journal ofpersonality and social psychology (1), 95 (1985)50. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: Anopen architecture for collaborative ﬁltering of netnews. In: Proceedings of the 1994ACM Conference on Computer Supported Cooperative Work. pp. 175–186. CSCW’94, ACM, New York, NY, USA (1994). https://doi.org/10.1145/192844.192905, http://doi.acm.org/10.1145/192844.192905

51. Rotter, J.B.: A new scale for the measurement of interpersonal trust. Journalof Personality (4), 651–665 (Dec 1967). https://doi.org/10.1111/j.1467-6494.1967.tb01454.x, http://onlinelibrary.wiley.com/doi/10.1111/j.1467-6494.1967.tb01454.x/abstract

52. Rousseau, D.M., Sitkin, S.B., Burt, R.S., Camerer, C.: Not So Diﬀerent After All:A Cross-Discipline View Of Trust. Academy of Management Review (3), 393–404 (Jul 1998). https://doi.org/10.5465/AMR.1998.926617, http://amr.aom.org/content/23/3/393

53. Rubens, N., Elahi, M., Sugiyama, M., Kaplan, D.: Active learning in recommendersystems. In: Recommender systems handbook, pp. 809–846. Springer (2015)54. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang,Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenetlarge scale visual recognition challenge. International Journal of Computer Vision (3), 211–252 (Dec 2015). https://doi.org/10.1007/s11263-015-0816-y, https://doi.org/10.1007/s11263-015-0816-y

55. Schou, J., Farkas, J.: Algorithms, interfaces, and the circulation of information:Interrogating the epistemological challenges of facebook. KOME: An InternationalJournal of Pure Communication Inquiry (1), 36–49 (2016)56. Stumpf, S., Rajaram, V., Li, L., Wong, W.K., Burnett, M., Dietterich, T., Sul-livan, E., Herlocker, J.: Interacting meaningfully with machine learning systems:Three experiments. International Journal of Human-Computer Studies (8), 639– 662 (2009). https://doi.org/https://doi.org/10.1016/j.ijhcs.2009.03.004,

57. Tullio, J., Dey, A.K., Chalecki, J., Fogarty, J.: How it works: a ﬁeld study ofnon-technical users interacting with an intelligent system. In: Proceedings of theSIGCHI Conference on Human Factors in Computing Systems. pp. 31–40. ACM(2007)58. Woodruﬀ, A., Fox, S.E., Rousso-Schindler, S., Warshaw, J.: A qualitative explo-ration of perceptions of algorithmic fairness. In: Proceedings of the 2018 CHIConference on Human Factors in Computing Systems. pp. 656:1–656:14. CHI’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3173574.3174230, http://doi.acm.org/10.1145/3173574.3174230http://doi.acm.org/10.1145/3173574.3174230