[PDF] Prioritizing Original News on Facebook

Abstract

This work outlines how we prioritize original news, a critical indicator of news quality. By examining the landscape and life-cycle of news posts on our social media platform, we identify challenges of building and deploying an originality score. We pursue an approach based on normalized PageRank values and three-step clustering, and refresh the score on an hourly basis to capture the dynamics of online news. We describe a near real-time system architecture, evaluate our methodology, and deploy it to production. Our empirical results validate individual components and show that prioritizing original news increases user engagement with news and improves proprietary cumulative metrics.

Full PDF

PPrioritizing Original News on Facebook

Xiuyan Ni [email protected] Inc.

Shujian Bu [email protected] Inc.

Igor L. Markov [email protected] Inc.

ABSTRACT

This work outlines how we prioritize original news, a critical indi-cator of news quality. By examining the landscape and life-cycle ofnews posts on our social media platform, we identify challenges ofbuilding and deploying an originality score. We pursue an approachbased on normalized PageRank values and three-step clustering,and refresh the score on an hourly basis to capture the dynamics ofonline news. We describe a near real-time system architecture, eval-uate our methodology, and deploy it to production. Our empiricalresults validate individual components and show that prioritizingoriginal news increases user engagement with news and improvesproprietary cumulative metrics.

CCS CONCEPTS • Information systems → Information retrieval ; Data mining ;• Computing methodologies → Machine learning.

KEYWORDS

News, News Feed, Originality, PageRank, Clustering, Ranking

Large amounts of news are published online every day, and manypeople now primarily consume news online [29]. News qualityaffects how people consume news and which platforms they pre-fer [13, 24]. Expressing news quality numerically can facilitatesignificant improvements for users and platforms [11]. Among var-ious aspects of news quality, we focus on originality , which canbe contrasted with duplicates, slightly edited text, and coveragethat references original news. Producing original news is laboriousand requires expertise, but such efforts initiate the typical newscycle and drive the entire news industry. Original news informpeople around the world, from breaking news, eye-witness reportsand critical updates at the time of crisis, to in-depth investigativereports that uncover new facts. Prioritizing original news online isin everyone’s long-term interest [11].In this work, we first explore the landscape of online news, us-ing the Facebook platform as an example. To enable a quantitativeapproach, we tabulate the spectrum of news originality from com-pletely unoriginal to highly original news. Our static analysis sug-gests that highly original news are rare, despite a large inventorywhich needs to be indexed and processed to accurately identify theoriginal ones. We also explore the dynamics of the news life-cycleon Facebook and find that news posts typically attain the greatestexposure in the first couple of hours, followed by a long tail. Thisresult suggests that an originality score used to improve News Feedranking must be computed promptly.Given two challenges — search quality at scale and fast response— we build a near real-time system and construct a synthesizedsignal for news originality. News articles that cover the same news (cid:4)(cid:69)(cid:69)(cid:93)(cid:103)(cid:71)(cid:81)(cid:91)(cid:79)(cid:3)(cid:106)(cid:93)(cid:3)(cid:103)(cid:73)(cid:100)(cid:93)(cid:103)(cid:106)(cid:104)(cid:3)(cid:81)(cid:91)(cid:3)(cid:89)(cid:93)(cid:69)(cid:60)(cid:89)(cid:3)(cid:91)(cid:73)(cid:113)(cid:104)(cid:100)(cid:60)(cid:100)(cid:73)(cid:103)(cid:3)(cid:43)(cid:107)(cid:68)(cid:89)(cid:81)(cid:104)(cid:80)(cid:73)(cid:103)(cid:3)(cid:195)(cid:3)(cid:141)(cid:3)(cid:106)(cid:80)(cid:81)(cid:104)(cid:3)(cid:90)(cid:93)(cid:91)(cid:93)(cid:89)(cid:81)(cid:106)(cid:80)(cid:144)(cid:144)(cid:144) (cid:43)(cid:107)(cid:68)(cid:89)(cid:81)(cid:104)(cid:80)(cid:73)(cid:103)(cid:3)(cid:195)(cid:143)(cid:3)(cid:33)(cid:115)(cid:104)(cid:106)(cid:73)(cid:103)(cid:115)(cid:3)(cid:36)(cid:68)(cid:73)(cid:89)(cid:81)(cid:104)(cid:88)(cid:3)(cid:4)(cid:100)(cid:100)(cid:73)(cid:60)(cid:103)(cid:104)(cid:3)(cid:93)(cid:91)(cid:3)(cid:43)(cid:81)(cid:91)(cid:73)(cid:3)(cid:33)(cid:93)(cid:107)(cid:91)(cid:106)(cid:60)(cid:81)(cid:91) (cid:68)(cid:73)(cid:87)(cid:72)(cid:85)(cid:3)(cid:68)(cid:3)(cid:80)(cid:72)(cid:87)(cid:68)(cid:79)(cid:3)(cid:80)(cid:82)(cid:81)(cid:82)(cid:79)(cid:76)(cid:87)(cid:75)(cid:3)(cid:68)(cid:79)(cid:86)(cid:82)(cid:3)(cid:68)(cid:83)(cid:83)(cid:72)(cid:68)(cid:85)(cid:72)(cid:71)(cid:3)(cid:82)(cid:81)(cid:3)(cid:68)(cid:3)(cid:53)(cid:82)(cid:80)(cid:68)(cid:81)(cid:76)(cid:68)(cid:81)(cid:3)(cid:75)(cid:76)(cid:79)(cid:79)(cid:86)(cid:76)(cid:71)(cid:72)(cid:17)(cid:3) (cid:51)(cid:88)(cid:69)(cid:79)(cid:76)(cid:86)(cid:75)(cid:72)(cid:85)(cid:3)(cid:20)(cid:29)(cid:3)(cid:55)(cid:75)(cid:76)(cid:85)(cid:71)(cid:3)(cid:48)(cid:92)(cid:86)(cid:87)(cid:72)(cid:85)(cid:76)(cid:82)(cid:88)(cid:86)(cid:3)(cid:48)(cid:82)(cid:81)(cid:82)(cid:79)(cid:76)(cid:87)(cid:75)(cid:3)(cid:43)(cid:68)(cid:86)(cid:3)(cid:49)(cid:82)(cid:90)(cid:3)(cid:36)(cid:83)(cid:83)(cid:72)(cid:68)(cid:85)(cid:72)(cid:71)(cid:3)(cid:44)(cid:81)(cid:3)(cid:38)(cid:68)(cid:79)(cid:76)(cid:73)(cid:82)(cid:85)(cid:81)(cid:76)(cid:68) (cid:43)(cid:107)(cid:68)(cid:89)(cid:81)(cid:104)(cid:80)(cid:73)(cid:103)(cid:3)(cid:195)(cid:143)(cid:3) (cid:48)(cid:92)(cid:86)(cid:87)(cid:72)(cid:85)(cid:76)(cid:82)(cid:88)(cid:86)(cid:3)(cid:48)(cid:82)(cid:81)(cid:82)(cid:79)(cid:76)(cid:87)(cid:75)(cid:3)(cid:47)(cid:76)(cid:78)(cid:72)(cid:3)(cid:50)(cid:81)(cid:72)(cid:3)(cid:41)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:44)(cid:81)(cid:3)(cid:56)(cid:87)(cid:68)(cid:75)(cid:3)(cid:36)(cid:83)(cid:83)(cid:72)(cid:68)(cid:85)(cid:86)(cid:3)(cid:50)(cid:81)(cid:3)(cid:53)(cid:82)(cid:80)(cid:68)(cid:81)(cid:76)(cid:68)(cid:81)(cid:3)(cid:43)(cid:76)(cid:79)(cid:79) Figure 1: Citations in news articles. The top snippet cites anarticle by another publisher. The cited article cites anotherarticle from the same publisher. event are clustered together based on specialized BERT embeddings[8], which are finetuned on pairwise-labeled data (same subject ordifferent subjects). After evaluating several clustering algorithmsagainst human-labeled pairwise data, we settle on a two-stageclustering algorithm that is both effective and highly scalable tolarge datasets. To adequately capture news dynamics, our systemperforms incremental updates on an hourly basis.We concluded that content alone is insufficient to judge newsoriginality, but behavioral signals such as citations of prior postscan also be used. Integrity considerations are particularly impor-tant, given the high incentives to game online news distribution. Tode-bias our algorithms, we filter out news articles produced withinpatterns of nefarious activity. We first evaluate the performance ofour originality signal offline against ratings by professional journal-ists. Online evaluation is based on an A/B test where we additionallymonitor the impact on news article ranking [33]. The signal is in-corporated in the News Feed ranking system.Our contributions include: • We examine the news originality landscape and the dynamicsof the news life-cycle, then propose a quantitative approachto reason about the news ecosystem. We categorize the levelof news originality by the effort spent to generate newscontent. • We propose a methodology and architect a near real-timesystem that processes individual news articles at a large scale.Using the PageRank algorithm and three-step clustering,it calculates a synthetic score to estimate news originality.PageRank normalization within clusters is particularly novel.The method can be applied to other news serving systems. • To facilitate live-data analysis of perceived news quality andof news quality scores, we develop quantitative and quali-tative methods. These methods can zoom in on individualnews articles and their distribution, and also measure entirenews ecosystems. Such analyses help both news publishersand consumers, which now depend on online news [11]. a r X i v : . [ c s . S I] F e b , In this section, we first review the ideas behind PageRank andintroduce the news citation graph. Then we outline ranking atFacebook, where we deploy our originality signal. However, othersocial media use conceptually similar ranking systems and ourcontributions are not specific to Facebook.

The PageRank algorithm was originally developed at Google to rankWeb pages and sites to improve search results [2, 4, 5, 26, 34]. Math-ematically, it is a random-walk based algorithm to rank vertices ina graph. A Web page with many incoming links from large-weightweb pages, has a greater weight. Page weights are propagated fromeach Web page to pages it links to. In the news domain, the workby Del Corso et al. [7] introduced a related graph-based rankingalgorithm where each vertex represents a news source, focusing onauthoritative news sources and interesting news events.Ye and Skiena [34] built an automated ranking system calledMediaRank to rank news sources. They applied the PageRank algo-rithms on news reporting citation to rank news sources and provedthat PageRank values are positively related to reporting qualitymeasured by peer reputation and so on. Zhang et al. [35] intro-duced a set of signals for indicating the credibility of news collectedfrom expert annotators. They grouped their indicators into twocategories. The first group contains content indicators determinedby the articles themselves — mentions of organizations, studies, etc.Context indicators in the other group require analysis of externalsources, such as author reputation and/or recognition by peers interms of the PageRank algorithm, as in Cresci et al. [4, 5].Similar to citations in academic papers, it is common to citecredible peers in the news industry, and such citations are importantindicators of news source quality. Therefore, we introduce the newscitation graph at news article level, instead of the domain level, toestimate the credibility of individual news articles. The idea is thatwhen a news articleƒ is disproportionately cited by its peers, thisindicates higher journalistic credibility. Whereas academic papersitemize their references and use reference numbers in citations,news articles follow a different style. In this work, we only considercitations in the form of links in a news article to other news articles.Figure 1 illustrates news article citations. The example at thetop is from a news article by Publisher 1. This article cites multi-ple sources, one of them is shown: a news article from a anotherpublisher, which cites another article by the same publisher. If apublisher breaks the story about an important news event, manyother articles and publishers will cite it.We take snapshots of the news ecosystem and index all ournotation by time 𝑡 (Section 5.1). In particular, V is the set of allnews articles at time 𝑡 , and 𝑣 ∈ V denotes an individual article.We cluster such articles by news event or news story (Section 5.2),denoting individual clusters C ⊂ V . When a news article 𝑣 citesanother article 𝑢 , we represent this by a directed edge 𝑒 𝑣,𝑢 ∈ E ,where E is the set of edges in the citation graph. We also say that 𝑒 𝑣,𝑢 is 𝑣 ’s outbound edge and 𝑢 ’s inbound edge. Using these directededges, we can compute the PageRank values of individual vertices(Section 5.1) by iteratively applying the following formula on everyvertex in the graph in a topological order: 𝑛 𝑣 = − 𝑑 |N | + 𝑑 ∑︁ 𝑢 ∈B 𝑣 𝑛 𝑢 |B 𝑣 | , (1)where 𝑛 𝑣 is the PageRank of article 𝑣 (initialized to 1) at time 𝑡 , B 𝑣 denotes the set of adjacent vertices (neighbors) of vertex 𝑣 , |B 𝑣 | is the number of neighbors of 𝑣 , and 𝑑 is a (constant) damping factor ,usually set to 0.85. The latter parameter dampens the propagationof weights through multiple edges. When estimating article originality, it is important to check howsimilar two articles are. Such checks are commonly implementedwith cosine similarity on vector embeddings. To produce necessaryembeddings, prior work uses the BERT (Bidirectional Encoder Rep-resentations from Transformers) network architecture [8], whichachieved state-of-art results in many natural language processingtasks across different applications [21, 27]. BERT handles previouslyunseen words by breaking them down into known subword frag-ments. It can also be updated on a regular basis to handle emergingkeywords such as "COVID". Original BERT models were DNNspre-trained on the BooksCorpus [37] and the English Wikipedia.However, BERT networks can be specialized to a given use caseby adding one dense layer and training it on adequate labeled data.Along these lines, Reimers and Gurevych [28] proposed a Sentence-BERT architecture that uses the Siamese network structure in thecontext of semantic similarity estimation.

The ranking of news has been extensively studied both in academiaand industry [7, 17, 34]. A number of publications in the informationretrieval community address this subject [6, 15, 19, 29, 31, 35, 36]. In2018, Nuzzle announced a ranking system for news sources calledNuzzleRank that integrates various signals, including publisherauthority information, into a single score to rank news sources.Facebook’s News Feed ranks not only news content, but alsoevents from users’ social graph [20, 24]. Ranking objectives opti-mize long-term user satisfaction, account for communities (friendsand family, etc) [23] and News Feed integrity [16](e.g., to discourageclickbait and prevent unlawful activities). When a user logs in toFacebook, they see their News Feed — which includes fresh updatesfrom their friends, groups they joined, and pages they followed.News Feed ranking can be roughly divided into four stages: inven-tory, signals, prediction, and relevance scores [20, 24, 25]. Once apiece of content is posted, numerous signals are extracted — pub-lication time, engagement counts, etc. Those signals are used toestimate the probabilities of possible individual user actions foreach piece of content in the inventory, should they see it [20, 24, 25].As a matter of notation, P(comment) represents the probability thata user comments on the update, while P(like) represents the prob-ability that user likes the content. At the last stage, we combinethese predictions and compute a ranking relevance score for eachpiece of content. Our news originality signal is deployed within thissystem summarized in Figure 2. News Feed ranking at Facebookincorporates many signals, and our originality signal enacts onlysubtle changes to the user experience as we explain later. https://nuzzel.com/rank rioritizing Original News on Facebook ,, (cid:25)(cid:91)(cid:112)(cid:73)(cid:91)(cid:106)(cid:93)(cid:103)(cid:115) (cid:47)(cid:81)(cid:79)(cid:91)(cid:60)(cid:89) (cid:144)(cid:144)(cid:144) (cid:43)(cid:165)(cid:69)(cid:93)(cid:90)(cid:90)(cid:73)(cid:91)(cid:106)(cid:166)(cid:43)(cid:165)(cid:89)(cid:81)(cid:88)(cid:73)(cid:166) (cid:43)(cid:103)(cid:73)(cid:71)(cid:81)(cid:69)(cid:106)(cid:81)(cid:93)(cid:91) (cid:53)(cid:72)(cid:79)(cid:72)(cid:89)(cid:68)(cid:81)(cid:70)(cid:72) (cid:3)(cid:104)(cid:69)(cid:93)(cid:103)(cid:73) (cid:194)(cid:144)(cid:196)(cid:195)(cid:144)(cid:202) (cid:144)(cid:144)(cid:144) (cid:55)(cid:80)(cid:93)(cid:139)(cid:55)(cid:80)(cid:73)(cid:91)(cid:139) (cid:144)(cid:144)(cid:144) (cid:24)(cid:93)(cid:113)(cid:3)(cid:81)(cid:104)(cid:3)(cid:81)(cid:106)(cid:3)(cid:17)(cid:91)(cid:79)(cid:60)(cid:79)(cid:73)(cid:71)(cid:139) (cid:43)(cid:165)(cid:93)(cid:103)(cid:81)(cid:79)(cid:81)(cid:91)(cid:60)(cid:89)(cid:166)(cid:3) (cid:42)(cid:62)(cid:121)(cid:103)(cid:44)(cid:62)(cid:144)(cid:137)(cid:189)(cid:148)(cid:144)(cid:189)(cid:85)(cid:131)(cid:191)(cid:62)(cid:191)(cid:131)(cid:148)(cid:144)(cid:189)(cid:121)(cid:179)(cid:62)(cid:160)(cid:129)(cid:10)(cid:138)(cid:197)(cid:182)(cid:191)(cid:103)(cid:179)(cid:131)(cid:144)(cid:121) (cid:32)(cid:148)(cid:179)(cid:141)(cid:62)(cid:138)(cid:131)(cid:211)(cid:103) (cid:199)(cid:201)(cid:201)(cid:202) (cid:194)(cid:194)(cid:196)(cid:197) Figure 2: News Feed ranking at Facebook

Here we examine the news originality landscape and motivate ourwork. Then we investigate the life-cycle of news stories on socialmedia platforms. Understanding the news life-cycle is critical todeploying the originality signal within News Feed ranking.

Our quantitative approach to news originality uses content buckets:a) completely unoriginal , scraped or spun content with no edi-torial effortb) highly unoriginal , with very low editorial effortc) somewhat unoriginal , may be editorially produced but heavilycite other content without original reporting or analysisd) potentially original but lacking peer recognition e) recognized as original by peers : breaking news, eyewitnessreports, exclusive scoop, investigative reporting, etc Scraped content is copied from other sources without editorialefforts.

Spun content is taken from a post or a Web page, and postedwith only minor modifications by humans or machines (see exam-ples in Table 1). Common methods include paraphrasing, replacingwords, and reordering paragraphs. By automating the spinning ofexisting content one can quickly produce a large amount of contentwithout scraping. Scraped and spun content can eclipse originalcontent and undermine its value, which warrants removal or limiteddistribution compared to original content.

Highly unoriginal articles are produced by low-effort text changes.We find most of the news articles actually fall into the third bucket - ~70% Covered by Citation Graph (a) (b) (c) + (d) (e)

Originality Landscape

Figure 3: News originality by bucket: (a) completely uno-riginal; (b) highly unoriginal; (c) somewhat unoriginal; (d)potentially original but lacking peer recognition; (e) recog-nized as original by peers. For each bucket, we show esti-mated total views received by all news articles. somewhat unoriginal . These articles may provide useful information,but do not require much effort to produce.

Potentially original but lacking peer recognition — this bucketincludes content that does not fit in earlier buckets and so may beoriginal, but for various reasons does not receive peer recognitionthroughout the news cycle. Opinion pieces that receive little supportoften fall into this category. Thus, citation signals alone cannotdistinguish between this bucket and unoriginal articles.The highly original news are produced with significant effort tofact-check information and produce clear narratives, high-qualitywriting and visuals. Thoughtful and original news content is usuallycited heavily by industry peers and contributes to the reputation ofindividual content creators. Due to the effort and expertise required,the original news content are scarce. Prioritizing the distributionof original content can help it reach greater audiences and benefitsboth the readers and the news industry in the long run [11].In general, it is difficult to judge each article for originality inisolation because this would require careful analysis of contentswith the understanding of current events. Particularly challengingwould be to distinguish rumors and fake news from reasonablereporting. Therefore, we draw additional insights from the newscitation graph and the dynamics of online news. The special cases ofscraped and spun content are handled by dedicated systems that arebased on text hashing and fingerprinting, as well as text similaritymetrics. In practice, such content does not appear in users’ NewsFeed inventory and is therefore not treated in our work.

Table 1: Examples of spun content. Publisher 1 postedoriginal articles, while Publisher 2 replaced isolated words,phrases, and sentences in articles from Publisher 1.

Publisher 1 - Original Publisher 2 - SpunIsrael grants Rashida TlaibWest

Bank visit onhumanitarian grounds Israel grants Rashida TlaibWest

Financial Institutiongo to humanitariangroundsIsrael’s interior ministeron Friday said Israel’s inside minister onFriday saidPod

Foods gets

VCbacking to reinvent grocerydistribution Pod

Meals will get

VCbacking to reinvent grocerydistribution , News content published on the Internet can be easily indexedand archived, but it social media platforms tend to favor freshnews. That’s why news reporters strive to break a new story. Tore-examine this conventional wisdom and determine how to reflectit in our work, we explore a large volume of news articles sharedon Facebook and track the dynamics of user engagement metrics.We also visualize the life-cycle of typical online news stories andcheck the impact of adding valuable information days after theoriginal publication. As it turns out, the same pattern persistsacross different news categories — world and local news, politicsand entertainment news.Figure 4 illustrates how quickly users lose interest in a particularstory. On September 27, 2019 Disney and Sony reached a deal forSpiderman movies, announcing that Spiderman would stay in theMarvel Universe. One publisher reported the story first. Almost 800websites covered the news on the exact same day. On the secondday, the engagement metrics of this story dropped significantly andeventually vanished on September 29, in just 3 days.Figure 5 shows that adding information at a later time doesnot help gain traffic. On November 11, 2019 the Ebola vaccine byJohnson & Johnson was approved. Our inventory showed that 17websites published 34 related articles on that day, and user engage-ment metrics hit a peak. The news was first reported by a publisherwho focuses on life science and medicine, which gained most traffic.Two days later, on November 13, the World Health Organizationofficially approved the vaccine. Many mainstream publishers cov-ered this news, and we observed an inventory increase. However,this did not stimulate another engagement peak: traffic was mostlyflat and almost vanished after seven days.Our data analysis suggests that ranking interventions can onlybe effective early in the life-cycle of a news story. This directlyimpacts the architecture and implementation of News Feed rank-ing, posing challenges to both signal computation and rankingdeployment. Therefore, we only focus on news articles publishedwithin the last seven days. Our originality ranking interventiondoes not dramatically change users’ News Feed experience becausewe do not alter the existing inventory of posts. However, the aggre-gated effect should reward publishers and reporters that producethoughtful and original content.

Figure 4: The life-cycle of a Spiderman story

Intuitively, news originality refers to the process by which newscontent is created as well as the quality of news content. How-ever, capturing these notions computationally appears challenging,especially when the content creation process remains opaque. Pro-fessional journalists and rates often find isolated text insufficient torate originality and need additional context. Useful context includesongoing news events and how much coverage they enjoyed, andalso how a given news article is perceived by peers in the newsecosystem. A major precept in our work is that direct content anal-ysis is neither sufficient nor necessary, whereas adequate contextmay provide sufficient signals to estimate originality.To capture the context of individual news articles, we constructa news citation graph (Section 2.1) for the entire news inventoryat a fixed time. Peer recognition of each article is evaluated usingthe PageRank algorithm on this graph. An original piece of newscould be cited by different publishers; it could also be a local newsstory cited by a major publisher with many subsequent citations —both cases are captured adequately by PageRank. Here we empha-size the use of global PageRank values not restricted to particularnews events. That is because quality articles often cite out-of-topicbackground material and may be cited under later news events. (cid:34)(cid:107)(cid:90)(cid:68)(cid:73)(cid:103)(cid:3)(cid:93)(cid:78)(cid:3)(cid:60)(cid:103)(cid:106)(cid:81)(cid:69)(cid:89)(cid:73)(cid:104)(cid:34)(cid:107)(cid:90)(cid:68)(cid:73)(cid:103)(cid:3)(cid:93)(cid:78)(cid:3)(cid:100)(cid:107)(cid:68)(cid:89)(cid:81)(cid:104)(cid:80)(cid:73)(cid:103)(cid:104)(cid:15)(cid:60)(cid:115)(cid:3)(cid:194)(cid:3)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:195)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:196)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:197) (cid:197)(cid:114)(cid:3)(cid:195)(cid:114) (cid:48)(cid:80)(cid:73)(cid:3)(cid:91)(cid:73)(cid:113)(cid:104)(cid:3)(cid:104)(cid:106)(cid:93)(cid:103)(cid:115)(cid:3)(cid:100)(cid:73)(cid:60)(cid:88)(cid:73)(cid:71)(cid:3)(cid:93)(cid:91)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:196)(cid:144)(cid:3)(cid:34)(cid:73)(cid:113)(cid:104)(cid:3)(cid:69)(cid:93)(cid:112)(cid:73)(cid:103)(cid:60)(cid:79)(cid:73)(cid:3)(cid:81)(cid:104)(cid:3)(cid:81)(cid:91)(cid:81)(cid:106)(cid:81)(cid:60)(cid:106)(cid:73)(cid:71)(cid:3)(cid:93)(cid:91)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:194)(cid:144) (cid:34) (cid:107) (cid:90) (cid:68) (cid:73) (cid:103) (cid:3) (cid:93) (cid:78) (cid:3) (cid:100)(cid:107)(cid:68) (cid:89)(cid:81) (cid:104) (cid:80) (cid:73) (cid:103) (cid:104) (cid:155) (cid:60) (cid:103) (cid:106) (cid:81) (cid:69) (cid:89) (cid:73) (cid:104) (a) (cid:15)(cid:60)(cid:115)(cid:3)(cid:194)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:195)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:196)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:197) (cid:197)(cid:193)(cid:114)(cid:201)(cid:193)(cid:114)(cid:194)(cid:193)(cid:114) (cid:48)(cid:80)(cid:73)(cid:3)(cid:104)(cid:106)(cid:93)(cid:103)(cid:115)(cid:3)(cid:100)(cid:73)(cid:60)(cid:88)(cid:73)(cid:71)(cid:3)(cid:93)(cid:91)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:194)(cid:3)(cid:68)(cid:73)(cid:78)(cid:93)(cid:103)(cid:73)(cid:3)(cid:90)(cid:60)(cid:81)(cid:91)(cid:3)(cid:100)(cid:107)(cid:68)(cid:89)(cid:81)(cid:104)(cid:80)(cid:73)(cid:103)(cid:104)(cid:3)(cid:104)(cid:106)(cid:60)(cid:103)(cid:106)(cid:73)(cid:71)(cid:3)(cid:69)(cid:93)(cid:112)(cid:73)(cid:103)(cid:81)(cid:91)(cid:79)(cid:3)(cid:81)(cid:106)(cid:144)(cid:3)(cid:36)(cid:91)(cid:3)(cid:15)(cid:60)(cid:115)(cid:3)(cid:196)(cid:141)(cid:3)(cid:55)(cid:24)(cid:36)(cid:3)(cid:93)(cid:78)(cid:78)(cid:81)(cid:69)(cid:81)(cid:60)(cid:89)(cid:89)(cid:115)(cid:3)(cid:60)(cid:91)(cid:91)(cid:93)(cid:107)(cid:91)(cid:69)(cid:73)(cid:71)(cid:3)(cid:112)(cid:60)(cid:69)(cid:69)(cid:81)(cid:91)(cid:73)(cid:3)(cid:60)(cid:100)(cid:100)(cid:103)(cid:93)(cid:112)(cid:60)(cid:89)(cid:141)(cid:3)(cid:68)(cid:107)(cid:106)(cid:3)(cid:73)(cid:91)(cid:79)(cid:60)(cid:79)(cid:73)(cid:90)(cid:73)(cid:91)(cid:106)(cid:3)(cid:69)(cid:93)(cid:91)(cid:106)(cid:81)(cid:91)(cid:107)(cid:73)(cid:71)(cid:3)(cid:71)(cid:73)(cid:69)(cid:103)(cid:73)(cid:60)(cid:104)(cid:81)(cid:91)(cid:79)(cid:144)(cid:3) (cid:17) (cid:91) (cid:79) (cid:60) (cid:79) (cid:73) (cid:90) (cid:73) (cid:91) (cid:106) (b)

Figure 5: The life-cycle of the J&J Ebola vaccine story rioritizing Original News on Facebook ,, (cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:49)(cid:72)(cid:90)(cid:86)(cid:3)(cid:76)(cid:81)(cid:89)(cid:72)(cid:81)(cid:87)(cid:82)(cid:85)(cid:92)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:73)(cid:85)(cid:82)(cid:80)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:79)(cid:68)(cid:86)(cid:87)(cid:3)(cid:26)(cid:3)(cid:71)(cid:68)(cid:92)(cid:86)(cid:3) (cid:38)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:3)(cid:68)(cid:3)(cid:74)(cid:85)(cid:68)(cid:83)(cid:75)(cid:3)(cid:69)(cid:68)(cid:86)(cid:72)(cid:71)(cid:3)(cid:82)(cid:81)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:49)(cid:49) (cid:49)(cid:72)(cid:90)(cid:86)(cid:3)(cid:72)(cid:89)(cid:72)(cid:81)(cid:87)(cid:3)(cid:70)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:76)(cid:81)(cid:74)(cid:3)(cid:49)(cid:82)(cid:85)(cid:80)(cid:68)(cid:79)(cid:76)(cid:93)(cid:72)(cid:71)(cid:3)(cid:70)(cid:76)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:86)(cid:70)(cid:82)(cid:85)(cid:72)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:76)(cid:81)(cid:3)(cid:72)(cid:68)(cid:70)(cid:75)(cid:3)(cid:70)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85) (cid:3)(cid:3)(cid:3)(cid:3)(cid:49)(cid:72)(cid:90)(cid:86)(cid:3)(cid:50)(cid:85)(cid:76)(cid:74)(cid:76)(cid:81)(cid:68)(cid:79)(cid:76)(cid:87)(cid:92)(cid:3)(cid:54)(cid:76)(cid:74)(cid:81)(cid:68)(cid:79) (cid:41)(cid:76)(cid:79)(cid:87)(cid:72)(cid:85)(cid:3)(cid:72)(cid:71)(cid:74)(cid:72)(cid:86)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:76)(cid:81)(cid:87)(cid:72)(cid:74)(cid:85)(cid:76)(cid:87)(cid:92)(cid:3)(cid:70)(cid:82)(cid:81)(cid:70)(cid:72)(cid:85)(cid:81)(cid:86)(cid:3)(cid:3) (cid:41)(cid:76)(cid:81)(cid:71)(cid:3)(cid:49)(cid:72)(cid:68)(cid:85)(cid:72)(cid:86)(cid:87)(cid:3)(cid:49)(cid:72)(cid:76)(cid:74)(cid:75)(cid:69)(cid:82)(cid:85)(cid:86)(cid:3)(cid:73)(cid:82)(cid:85)(cid:3)(cid:72)(cid:68)(cid:70)(cid:75)(cid:3)(cid:81)(cid:72)(cid:90)(cid:86)(cid:3)(cid:68)(cid:85)(cid:87)(cid:76)(cid:70)(cid:79)(cid:72)(cid:38)(cid:85)(cid:68)(cid:90)(cid:79)(cid:3)(cid:49)(cid:72)(cid:90)(cid:86)(cid:3)(cid:38)(cid:76)(cid:87)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:42)(cid:85)(cid:68)(cid:83)(cid:75)(cid:38)(cid:82)(cid:80)(cid:83)(cid:88)(cid:87)(cid:72)(cid:3)(cid:51)(cid:68)(cid:74)(cid:72)(cid:53)(cid:68)(cid:81)(cid:78)(cid:3)(cid:89)(cid:68)(cid:79)(cid:88)(cid:72)(cid:86) (cid:3)

Figure 6: The workflow of our methodology.

We try to capture news ecosystem dynamics and emulate howprofessional raters or journalists estimate news originality level.To this end, PageRank values cannot be compared across topicsand news events with very different amounts of news coverage.For a given news event or news story, we consider the entire newscoverage as a cluster. Our insight is that articles with the highestglobal PageRank values within each news-event cluster are most likelyto be original . Hence, we estimate news originality by normalizingglobal PageRank scores 𝑛 𝑣 within each cluster 𝐶 𝑣 : 𝑠 𝑣 = (cid:16) 𝑛 𝑝𝑣 (cid:205) 𝑢 ∈C 𝑣 𝑛 𝑝𝑢 (cid:17) 𝑝 , 𝑝 > (2)where C 𝑣 is the cluster of article 𝑣 , and the 𝑝 constant defaults to 𝑝 = . . Increasing 𝑝 would favor articles with higher 𝑛 𝑣 values.Our process of estimating news originality is shown in Figure 6.Notably, we cannot evaluate a newly published article for originalitybefore peers cite it. This introduces a delay and requires a near real-time system to deliver originality scores early in the news cycle.When using originality scores 𝑠 𝑣 in News Feed ranking, we firstconvert them into P(original) ∈ ( , ] as follows P ( original ) = max ( 𝑠 𝑣 , 𝜃 ) − 𝜃 − 𝜃 . (3)Here 𝜃 ∈ ( , ] is the promotion threshold, i.e., only contents with 𝑠 𝑡𝑖 > 𝜃 can be promoted. Then, we add P(original) to the relevancescore as a second-order term: Relevance = 𝛼 · P ( comment ) + 𝛼 · P ( share ) + 𝛼 · P ( like )+ · · · + 𝛼 𝑛 · P ( click ) · P ( original ) . (4)Here P(comment), P(share), P(like) and P(click) are probabilities ofrespective events for the news article in question, and the weights 𝛼 𝑖 maximize long-term user satisfaction. Clearly, our originalitysignal is just one component of News Feed ranking that elevatespeer-recognized content. Other signals elevate other content types. Our preliminary investigation found that news articles highly citedby other articles tend to exhibit a higher level of originality. There-fore, we first build a citation graph of all news articles published ina seven-day window. Then, we calculate global PageRank valuesfor individual articles, cluster news articles by news event/story ina scalable way, and normalize PageRank values within each cluster.

We index all the news articles shared on the platform by leveragingthe Facebook Crawler tool . The Facebook Crawler tool crawls theHTML of an app or website that was shared on Facebook via copy-ing and pasting the link or by a Facebook social plugin. There areother open-source crawlers that serve the same purpose. CommonCrawl is a well-maintained open repository of Web crawl datathat can be accessed and analyzed by anyone.We limit news articles in the graph to those posted within aseven-day moving window. After parsing the HTML, we traversethe output to get all tags, which define hyperlinks to otherWeb pages. Hyperlinks specified in the tag with different URLsmay point to the same Web page. Therefore, we resolve alls URLsto canonical URLs and assign each news citation graph vertexa unique ID based on a canonical URL. If the cited Web page isalso a recent news article, we establish an edge between the twonews article vertices. With this news citation graph, we computePageRank values for each news article.The raw citation graph is vulnerable to link farming , as per Duet al. [9]. That is, the graph may be manipulated by changing in-terconnected link structure of pages to add many inbound edgesto a target page. To counter such manipulations, we disregard sev-eral types of citations before applying the PageRank algorithm. Asshown in Figure 1, one typical example is self-linking edges in G 𝑡 that cite an article published by the same publisher. Some Web siteslink their articles to other Web sites without real content, but withautomatic redirect to phishing sites or simply return to the citingarticle. These integrity filters mitigate the risk of manipulation.A filtered citation graph snapshot at each hour typically contains300K–500K edges. The news articles without incoming and outgo-ing citations are excluded from the PageRank computation. Despitetheir long history, attempts to manipulate PageRank in Web searchhave been successfully addressed [14].The original PageRank calculations work well with graphs thatexhibit cycles, created when popular Web pages are revised to linkto pages published later. Unlike the Web link graph, our news cita-tion graph mostly contains links to past content since news postson social networks are typically not revised. PageRank calcula-tions simplify significantly on acyclic graphs and require a singlelinear-time graph traversal. However, in practice our citation graphcontains enough cycles to question such simplifications. https://developers.facebook.com/docs/sharing/webmasters/crawler https://commoncrawl.org/ https://developers.facebook.com/docs/sharing/webmasters/getting-started/versioned-link , We now outline our clustering technique. As explained in Section4, we normalize PageRank scores for individual news articles usingPageRank scores of other articles in the same cluster. Intuitively,an important national news event and a local breaking news mightcarry similar amount of originality, but original articles in a largercluster get more citations and higher PageRank scores. In additionto cluster normalization, computational scalability is also important— on an uneventful day, our inventory snapshot contains 2M-3Marticles, and we strive to process them in minutes.We estimate the topical similarity of articles based on their titles,noting that articles with identical titles may have different PageRankscores. We first lowercase article titles, remove punctuation andhash the titles to assemble duplicates into mini-clusters. For eachunique title, we calculate a vector embedding based on the powerfuland adaptable BERT DNN (Section 2.2). In addition to handlingsynonyms and equivalent phrases well, BERT also supports transferlearning. To this end, we use a Siamese-twins network architectureshown in Figure 7, previously proposed for semantic similarityestimation [28]. The two article titles are processed by the twoconstituent BERT models, which we implement in PyTorch usingHuggingFace transformers [32]. An additional layer on top of BERTis a 128-dimensional fully connected (FC) layer with 𝑡𝑎𝑛ℎ activation.In Figure 7, 𝑇 𝑖 represent the 𝑖 𝑡ℎ token in input sentences With theBERT network weights fixed, the top level is trained on labeledarticle pairs using the cosine embedding loss function LL( 𝑥 , 𝑥 , 𝑦 ) = (cid:40) − cos ( 𝑥 , 𝑥 ) if 𝑦 = ( , cos ( 𝑥 , 𝑥 ) − margin ) if 𝑦 = − (5)where 𝑥 and 𝑥 represent the two input sentences respectively. 𝑦 = means the two sentences are same news event, while 𝑦 = − means the two sentences are about completely different news event. (cid:39)(cid:72)(cid:81)(cid:86)(cid:72)(cid:3)(cid:47)(cid:68)(cid:92)(cid:72)(cid:85) (cid:3) (cid:39)(cid:72)(cid:81)(cid:86)(cid:72)(cid:3)(cid:47)(cid:68)(cid:92)(cid:72)(cid:85) (cid:3) (cid:49)(cid:72)(cid:90)(cid:86)(cid:3)(cid:87)(cid:76)(cid:87)(cid:79)(cid:72)(cid:3)(cid:20) (cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3) (cid:37)(cid:40)(cid:53)(cid:55) (cid:3)(cid:38)(cid:82)(cid:86)(cid:76)(cid:81)(cid:72)(cid:3)(cid:47)(cid:82)(cid:86)(cid:86) (cid:49)(cid:72)(cid:90)(cid:86)(cid:3)(cid:87)(cid:76)(cid:87)(cid:79)(cid:72)(cid:3)(cid:21)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:37)(cid:40)(cid:53)(cid:55) Figure 7: Estimating sentence similarity using pre-trainedBERT networks [28]. The shared dense layer is trained.

BERT-based vector embeddings optimized to capture title similar-ity by cosine similarity support vector-based clustering algorithms.The choice of algorithms is driven by quality considerations andthe ability to process millions of titles in several minutes, whichwe need to ensure frequent refresh of the news originality signal(in the context of Section 3.2). Clustering algorithms based onK-Nearest-Neighbors (KNN) are a natural starting point, but speci-fying 𝐾 is not straightforward and for any given 𝐾 such algorithmsrisk producing inconsistent results in our application. Therefore,our three-step clustering in Figure 8 combines text hashing andKNN with greedy local search. Topical clusters often contain just afew different titles, while national news receive up to thousandscitations per article.The set of unique article vectors is converted into an undirectedKNN graph G . For each vector, we find its 𝐾 = nearest neigh-bors based on cosine similarity (1 - cosine distance ) and use co-sine similarity for edge weights between adjacent vertices 𝑣 𝑡𝑖 and 𝑣 𝑡𝑗 . Lightweight edges are ignored, and subgraphs are defined byconnected components of the resulting graph. Reasonable weightthresholds are found with a form of binary search guided by asubgraph size target. See details in Algorithm 1.An investigation of typical outputs of Algorithm 1 suggestedthat clusters were generally reasonable, but local news and eventswith low coverage were not handled well. To remedy this deficiency,we form local clusters using greedy optimization to maximize thetotal edge weight 𝑤 𝑐 inside clusters. We impart a default negativeweight 𝜔 to pairs of vertices within a top-down cluster that are notconnected by edges (not nearest neighbors). The smaller the 𝜔 , theharder it is to create subclusters. For details, see Algorithm 2. Example 5.1.

Figure 8 illustrates local clusters in a subgraph: { 𝐴, 𝐵, 𝐶, 𝐷 } and { 𝐸, 𝐹 } . Suppose 𝜔 = − . . Then the total edgeweight in cluster 1 is 𝑤 = . + . + . + . + . − . = . (no edge between 𝐴 and 𝐶 ), and in cluster 2 𝑤 = . . Although 𝐴 and 𝐸 are connected, the edge weight is so low that adding 𝐸 would not increase the total weight of cluster 1. The same reasoningapplies to 𝐹 . Therefore local clustering produces two clusters. Building and processing the KNN graph with 𝐾 = nearestneighbors per vertex is a major performance bottleneck. On a typ-ical day, all news articles from the last week fit in the RAM of asingle server and can be processed reasonably quickly. However,this architecture is insufficiently scalable for the following reasons. (cid:171) Figure 8: Three-step clustering rioritizing Original News on Facebook ,,

Algorithm 1:

Split a graph into subgraphs with target size

Input:

Weighted graph G = {V , E} , subgraph target size 𝑡 ,optimization threshold 𝜖 , ℓ = 0.0, ℎ = . Output:

A set of subgraphs S of approximately target size 𝑡 Function findSubgraphs( G , 𝜖 , ℓ , ℎ ) : S = ∅ while ℎ − ℓ > 𝜖 do 𝑚 = ℓ + ℎ G ′ = G without edges of weight < 𝑚 C = connectedComponents ( G ′ ) foreach 𝑐 ∈ C doif | 𝑐 | > 𝑏 then Remove vertices in 𝑐 and their incidentedges from G S = S ∪ findSubgraphs( 𝑐 , 𝜖 , 𝑙 , 𝑚 ) endend ℎ = 𝑚 end G ′ = G without edges of weight < 𝑚 S = S ∪ connectedComponents ( G ′ ) return S endAlgorithm 2: Greedy local clustering

Input:

Weighted graph 𝒈 , negative weight 𝜔 for missingedges, number 𝑅 of independent randomized passes Output:

An integer 𝑐 𝑣 for each vertex 𝑣 (cluster assignment) repeat 𝑅 times Randomize the order of vertices in 𝒈 Initialize eachvertex 𝑣 in its own cluster 𝒄 𝑣 foreach 𝑣 ∈ 𝒈 doforeach 𝑢 ∈ B 𝑣 do Try moving 𝑣 from cluster 𝑐 𝑣 to cluster 𝑐 𝑢 Addup internal weights for 𝑐 𝑢 and 𝑐 𝑣 Record 𝑢 withthe highest sum of weights seen end Move 𝑣 to maximize the sum of weights of 𝑐 𝑣 and 𝑐 𝑢 endif (cid:205) 𝑤 𝑐 increased then repeat foreach 𝑣 ∈ 𝒈 else Record the solution with the highest (cid:205) 𝑤 𝑐 seen enduntil • Potential surges of the news inventory during the electionseason, the New Year’s Eve, etc. • Near real-time processing benefits from additional compute re-sources (lower processing latency via using multiple servers). • Need for scaling to larger content inventory . The challengewe are solving and our methods are fairly general, so can be

Table 2: Guidelines for rating the similarity of article pairs

Score Rating Criteria0.0 different subjects the two articles covercompletely differentsubjects1.0 different subjects /some commonality the two articles coverdifferent subject but withshare some content2.0 same subject /different aspects the two articles cover thesame subject but reportdifferent aspects of thesame story3.0 same subject the two articles cover thesame subjectsapplied to other social-network platforms that value orig-inality. Now or in the future, such platforms may enjoy amuch larger scale of content inventory.The overall design described in Section 5.2 naturally supports dis-tributed processing to ensure greater overall scalability and robust-ness to surges. In fact, this is why Algorithm 1 performs balanced partitioning. Our implementation supports distributed clusteringas well. We found that the upper bound on single-server capacityis an important parameter — individual servers must receive a suf-ficient amount of work to justify distributed processing, but thedata must fit into available RAM. Between the implied lower andupper bounds, there is a transition point where one can reduce theamount of computation at the cost of greater processing latency.

Before deploying our news originality signal to production at Face-book, we evaluate its functional components individually, evaluatethe entire signal with the help of professional raters, then embedthe signal into News Feed ranking and explore examples to checkthat everything works as expected. The production deployment isevaluated with an industry-standard technique —an A/B test onlive data for a limited subset of users before it is enabled for themain group of users [33].

In our rating flow, we ask professional raters to review pairs of newsarticles. The raters assign a similarity level to each pair of articles: different subjects , different subject but some common contents , samesubjects with different aspects , and same subjects (the four levelsare explained in Table 2). For training, we collect 100K pairs ofrandomly sampled English news titles, using 40% for finetuning,10% for validation, and 50% for test. Separately, we collect another10K pairs of news articles to evaluate clustering performance. Tosample likely-positive examples, we take some number of closestneighbors in terms of document embeddings and/or text similarity.Likely-negative samples are drawn from further-away neighborsthat are sufficiently close to make the labeling task nontrivial. , To compare our vector embeddings with FastText [18] and Pytorch-BigGraph [22] embeddings, we represent similarity levels numeri-cally by 0.0, 1.0, 2.0, 3.0 during training following Table 2. Duringevaluation, we binarize model scores at thresholds 0.5, 1.5 and 2.5,then use ROC AUC as the evaluation metric. For example, [email protected] article pairs with cosine similarity ≥ . . Table 3 describesthe performance of our BERTPairwise model, which consistentlyoutperforms pre-trained state-of-art embeddings.To evaluate our news-event clustering vs. human labels, werandomly sample 10K pairs of news articles in English from thecandidate pool and send the pairs to professional annotators, alongwith guidelines in Table2. Then, we apply the clustering algorithmsto the entire candidate pool. For each sampled pair, if the twoarticles appear the same cluster, the predicted label is positive,otherwise — negative. The clustering algorithm is evaluated by pre-cision and recall, then compared with two well-known algorithmsin Table 4. DBSCAN (density-based spatial clustering of applicationswith noise)[10, 30] is a highly scalable density-based algorithm.The

Louvain algorithm [1] is one of the fastest and best-knowncommunity detection algorithms for large networks.

To assess the accuracy of our citations score signal, we samplethe most viewed news articles identified as original, and the mostviewed article not identified as original from the most viewed newsdomains over a seven-day period. Our professional raters havemany years of news-industry experience and follow a deliberateprocess to ensure fair judgement for each article they rate on athree-point scale of news originality (Table 5). For the rating 3.0,our predicted labels match these results 90% of the time. In otherwords, our signal attains 90% accuracy in identifying original news.

Besides the quantitative evaluation, we also performed qualitativecase studies. Here we describe one example that illustrates how oursystem works. On January 26, 2020, an article 𝑛 about the death ofKobe Bryant in a Calabasas helicopter crash was first reported bythe publisher TMZ . In just 10 minutes, many publishers coveredthis story and cited TMZ. Over 200 articles fell into this news-eventcluster, and the original story by TMZ ranked the highest. For suchevents, users would see news articles posted by the newspages theyfollow and shared by their friends. If the original news article isin a users’ feed inventory, it gets prioritized. Note that our origi-nality signal is only one component in the ranking formula. Userswith preferences for certain publishers or strong affinity with theirfriends continue seeing articles shared by those actors. Table 3: The pairwise embedding vs. FastText [18] andPytorch-BigGraph [22] embeddings

Model AUC @ @ @ Table 4: The performance of three-stage clustering with DB-SCAN [10] and the Louvain algorithm [1]

Algorithm Precision RecallDBSCAN 43.07 73.04Louvain 81.01 47.57Stage 1 + Louvain 81.85 32.63three-stage clustering

Score Rating Criteria1.0 unoriginal borrows most of thecontent and language fromother sources or isextremely thin / lowinformation overall, andanything that is notproperly syndicated.2.0 possibly/somewhatunoriginal rewords borrowed contentwith its own language, but>70% is borrowed ORproperly syndicated3.0 fully original is not a syndicatedrepublishing, little to nocontent is borrowed

The originality signal is intended for the relevance score calculation(see Figure 2 and Equation 4) to increase the distribution of originalnews articles. To ensure its availability early in the news cycle, it isrecalculated from scratch on an hourly basis. Building the news cita-tion graph and news clusters takes only a few minutes, but systembottlenecks are observed in our current crawling infrastructure andin generating vector embeddings. In practice, it takes time for theoriginal articles to get cited, but running the workflow more oftencould find and promote original articles earlier. Such improvementsare likely with further infrastructure optimization.Before making proposed changes to News Feed ranking at Face-book, we consulted with the academic and publishing communitiesand performed careful empirical evaluation. In particular, we ranan A/B test on live data for several weeks, where the control groupused prior production ranking rules and a small test group usedrevised ranking rules [33]. To estimate impact, we computed the in-crease in view counts at different thresholds (Table 6) and found ourtechnique works well at different thresholds. We have not observedstatistically significant deteriorations in our proprietary metrics[11, 20, 23] during the A/B test or after the subsequent full productlaunch. After additional checks and consultations, our signal wasenabled for English-language content within Facebook’s News Feedranking system for most users in June 2020 [3].Publishers may try to manipulate our news originality signal. Tothis end, PageRank can be protected from abuse [14], wheras Face-book’s integrity monitoring and enforcement [16] has a particularfocus on coordinated inauthentic behaviors [12]. rioritizing Original News on Facebook ,,

Table 6: User engagement lift in promoting original news

Originality threshold Increase in num. views (%)0.4 15.360.5 14.720.6 14.300.7 13.830.8 13.38

In this paper, we introduce a strategy to prioritize original news insocial networks. This strategy computes PageRank scores of newsarticles and estimates originality by normalizing PageRank scoresfor each news event. Equation 2 is a particularly novel contribution.We deployed the originality signal to personalized FacebookNews Feed, which compiles articles from sources followed by theuser and user’s friends [11, 13, 20, 23–25]. When multiple articlesare available in a user’s inventory, we promote the more originalones. While subtle, such changes influence what the communitysees. As part of our work, we performed conceptual, qualitative andquantitative evaluation to confirm that our techniques positivelyimpact the news ecosystem. In particular, the exposure of originalcontent has grown, and users received more content they liked.Over a longer timeframe, these developments should encouragepublishers to invest more in original content.

ACKNOWLEDGMENTS

We would like to thank Jon Levin, Gabriella Schwarz, Lucas Adams,David Vickrey, Xiaohong Zeng, Joe Isaacson, Gedaliah Friedenberg,Pengfei Wang, Feng Yan, Jerry Fu, Songbin Liu, Yan Qi, RanjanSubramanian, Adrian Le Pera, Vasu Vadlamudi, Julia Smekalina andothers who supported and collaborated with us throughout.

REFERENCES [1] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefeb-vre. 2008. Fast unfolding of communities in large networks.

J. Statistical Mechan-ics: Theory and Experiment

Computer Networks

Prioritizing Original News Reporting onFacebook . Facebook Newsroom. https://about.fb.com/news/2020/06/prioritizing-original-news-reporting-on-facebook/[4] Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, andMaurizio Tesconi. 2015. Fame for sale: Efficient detection of fake Twitter followers.

Decision Support Systems

80 (2015), 56–71.[5] Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, andMaurizio Tesconi. 2017. The paradigm-shift of social spambots: Evidence, theories,and tools for the arms race. In

Proc. WWW . ACM, Perth, Australia, 963–972.[6] Gianmarco De Francisci Morales, Aristides Gionis, and Claudio Lucchese. 2012.From chatter to headlines: harnessing the real-time web for personalized newsrecommendation. In

Proc. WSDM . ACM, Washington, USA, 153–162.[7] Gianna M Del Corso, Antonio Gulli, and Francesco Romani. 2005. Ranking astream of news. In

Proc. WWW . ACM, Chiba, Japan, 97–106.[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding.In

In Proc. 17th NAACL . ACL, Minneapolis, Minnesota, 4171–4186. https ://doi.org/10.18653/v1/N19-1423[9] Ye Du, Yaoyun Shi, and Xin Zhao. 2007. Using spam farm to boost PageRank. In

Proc. Intl. Workshop on Adversarial Information Retrieval on the Web . ACM, BanffAlberta, Canada, 29–36.[10] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In

Proc. KDD , Vol. 96. AAAI Press, Portland, Oregon, USA, 226–231. [11] Facebook. 2019.

People, Publishers, the Community . Facebook Newsroom. https://about.fb.com/news/2019/04/people-publishers-the-community/[12] Facebook. 2021.

December 2020 Coordinated Inauthentic Behavior Report . FacebookNewsroom. https://about.fb.com/news/2021/01/december-2020-coordinated-inauthentic-behavior-report/[13] Facebook. 2021.

News content on Facebook

Penguin is now part of our core algorithm . Google Search CentralBlog. https://developers.google.com/search/blog/2016/09/penguin-is-now-part-of-our-core[15] Robert Gwadera and Fabio Crestani. 2009. Mining and ranking streams of newsstories using cross-stream sequential patterns. In

Proc. ACM Conference on Infor-mation and Knowledge Management . ACM, Hong Kong, China, 1709–1712.[16] Alon Halevy et al. 2020. Preserving Integrity in Online Social Networks. In

Proc.KDD . ACM, USA, arXiv:2009.10311.[17] Yang Hu, Mingjing Li, Zhiwei Li, and Wei-ying Ma. 2006. Discovering authorita-tive news sources and top news stories. In

Asia Information Retrieval Symposium .Springer, Beijing, China, 230–243.[18] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou,and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models.[19] Nattiya Kanhabua, Roi Blanco, and Michael Matthews. 2011. Ranking relatednews predictions. In

Proc. Intl. ACM SIGIR Conference on Research and Developmentin Information Retrieval . ACM, Beijing, China, 755–764.[20] Akos Lada, Meihong Wang, and Tak Yan. 2021.

How machine learning powersFacebook’s News Feed ranking algorithm . Facebook. https://engineering.fb.com/2021/01/26/ml-applications/news-feed-ranking/[21] Jinhyuk Lee et al. 2020. BioBERT: a pre-trained biomedical language representa-tion model for biomedical text mining.

Bioinformatics

36, 4 (2020), 1234–1240.[22] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, AbhijitBose, and Alex Peysakhovich. 2019. Pytorch-biggraph: A large-scale graphembedding system.

Proc. ML Sys Conference

Bringing People Closer Together . Facebook Newsroom. https://about.fb.com/news/2018/01/news-feed-fyi-bringing-people-closer-together/[24] Adam Mosseri. 2018.

News Feed Ranking in Three Minutes Flat . Facebook Inc.https://newsroom.fb.com/news/2018/05/inside-feed-news-feed-ranking/[25] Xiuyan Ni and other. 2019. Feature Selection for Facebook Feed Ranking Systemvia a Group-Sparsity-Regularized Training Algorithm. In

Proc. 28th ACM Intl.CIKM . ACM, Beijing, China, 2085–2088.[26] L. Page, S. Brin, R. Motwani, and T. Winograd. 1999.

The PageRank citationranking: Bringing order to the Web.

Technical Report. Stanford InfoLab.[27] Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversa-tional question answering challenge.

Trans. ACL

Proc. of EMNLP . ACL, Hong Kong, China,3982–3992. https://arxiv.org/abs/1908.10084[29] J. Reis, Fabrıcio Benevenuto, Pedro OS de Melo, Raquel Prates, Haewoon Kwak,and Jisun An. 2015. Breaking the news: First impressions matter on online news.[30] Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu.2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.

ACM Trans. on Database Systems (TODS)

42, 3 (2017), 1–21.[31] Alexandru Tatar, Panayotis Antoniadis, Marcelo Dias De Amorim, and SergeFdida. 2014. From popularity prediction to ranking online news.

Social NetworkAnalysis and Mining

4, 1 (2014), 174.[32] Thomas Wolf et al. 2020. Transformers: State-of-the-Art Natural LanguageProcessing. In

Proc. of EMNLP

Proc. KDD . ACM, https://doi.org/10.1145/2783258.2788602, 2227–2236.[34] Junting Ye and Steven Skiena. 2019. MediaRank: Computational ranking of onlinenews sources. In

Proc. KDD . ACM, Anchorage, AK, USA, 2469–2477.[35] A. X. Zhang et al. 2018. A structured response to misinformation: Defining andannotating credibility indicators in news articles. In

Proc. WWW . ACM, Lyon,France, 603–612.[36] Guanjie Zheng et al. 2018. DRN: A deep reinforcement learning framework fornews recommendation. In

Proc. WWW . ACM, Lyon, France, 167–176.[37] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching movies and reading books. In