[PDF] NELA-GT-2020: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Abstract

In this paper, we present an updated version of the NELA-GT-2019 dataset, entitled NELA-GT-2020. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-level ground truth labels from Media Bias/Fact Check (MBFC) covering multiple dimensions of veracity. Additionally, new in the 2020 dataset are the Tweets embedded in the collected news articles, adding an extra layer of information to the data. The NELA-GT-2020 dataset can be found at this https URL

Full PDF

NNELA-GT-2020: A Large Multi-Labelled News Dataset for The Study ofMisinformation in News Articles

Maur´ıcio Gruppi * , Benjamin D. Horne † and Sibel Adalı * Rensselaer Polytechnic Institute * , University of Tennessee Knoxville † [email protected], [email protected], [email protected] Abstract

In this paper, we present an updated version of the

NELA-GT-2019 dataset (Gruppi, Horne, and Adalı 2020),entitled

NELA-GT-2020 . NELA-GT-2020 contains nearly1.8M news articles from 519 sources collected betweenJanuary 1st, 2020 and December 31st, 2020. Just as with

NELA-GT-2018 (Nørregaard, Horne, and Adalı 2019) and

NELA-GT-2019 , these sources come from a wide range ofmainstream news sources and alternative news sources. In-cluded in the dataset are source-level ground truth labels fromMedia Bias/Fact Check (MBFC) covering multiple dimen-sions of veracity. Additionally, new in the 2020 dataset arethe Tweets embedded in the collected news articles, adding anextra layer of information to the data. The

NELA-GT-2020 dataset can be found at: https://doi.org/10.7910/DVN/CHMUYZ . The study of news production behavior of both high ve-racity and low veracity outlets continues to be a salient re-search topic. One of the continued challenges in this inter-disciplinary research area is having broadly labeled datasetsover extended periods of time. For example, mixed-methodswork on a variety of news-related research questions re-quire data surrounding speciﬁc events (An and Gower 2009;Horne et al. 2020), which are often difﬁcult to get post-event.Another example is machine learning research, which re-quires large, diverse, labeled datasets to train and test mod-els (Baly et al. 2018; Baly et al. 2020; Horne, Nørregaard,and Adali 2019b).There has been numerous efforts to collect news arti-cle data to ﬁll this need. These include the previous ver-sions of the NELA dataset (Gruppi, Horne, and Adalı 2020;Nørregaard, Horne, and Adalı 2019; Horne, Khedr, andAdalı 2018), the FA-KES dataset (Salem et al. 2019), andthe Golbeck et al. dataset (Golbeck et al. 2018). The vastmajority of datasets focused on information veracity havebeen social media posts rather than news articles, includingdatasets such as the FakeNewsNet dataset (Shu et al. 2018),and the LIAR dataset (Wang 2017).More recently, there has been a signiﬁcant effort in creat-ing COVID-19 information datasets, both for news articles

Data at: https://doi.org/10.7910/DVN/CHMUYZ and social media posts. Some of these datasets have verac-ity labels, others do not. For example, (Shahi and Nandini2020) created a news article dataset with fact-check labels,(Patwa et al. 2020) manually annotated a set of social mediaposts and articles, and (Memon and Carley 2020) created anannotated set of Tweets focused on COVID-19.In this paper, we present the latest version of the NELA-GT datasets,

NELA-GT-2020 . NELA-GT-2020 ﬁlls bothneeds described above, as it contains labeled news sourcesthat covered a broad set of events, including the COVID-19 pandemic and the 2020 U.S. Presidential Election. Moreprecisely, we have collected from

519 sources between

January 1st, 2020 and

December31st, 2020 . Additionally, we include source-level veracitylabels from Media Bias/Fact Check (MBFC) and embeddedTweets found in the news articles.In this short paper, we describe the differences be-tween the previous versions of the NELA-GT datasets and

NELA-GT-2020 . We also describe in detail the data col-lection method, ground truth collection method, embeddedtweet collection, and publicly available data formats. Lastly,we provide some metadata and a discussion of potential usecases.

In addition to the updated time frame, there are several newadditions to

NELA-GT-2020 :1.

More data : The

NELA-GT-2020 dataset contains1,779,127 news articles from 519 sources published in theyear 2020, up from 261 sources in

NELA-GT-2019 . Thesources added to the collection are mostly fringe, unreli-able news outlets. Moreover, past versions of the

NELA dataset focused on political news content. This time wehave expanded our news collection to a broader range oftopics, by collecting health-related news articles as wellas general news feeds provided by the sources.2.

Embedded Tweets : A new piece of information avail-able in

NELA-GT-2020 are the embedded tweets. Theseconsist of tweets featured in news articles. We providethe tweet content, author, date, and URL. We collected410,432 embedded tweets across all of the news articles.3.

Ground Truth change : In the 2020 dataset, we aggre-gated the ground truth information as provided by Media a r X i v : . [ c s . C Y ] F e b ias Fact Check (MBFC). Concretely, we assigned relia-bility labels to news sources based on MBFC’s FactualityScore. The labels are one of unreliable , mixed , or reliable . The data collection process follows what was describedin (Nørregaard, Horne, and Adalı 2019). Speciﬁcally, wescraped the RSS feeds of each source in our source collec-tion list twice a day starting on 01/01/2020 using the Pythonlibraries feedparser and goose. Our list of sources to collectwas carried over from (Nørregaard, Horne, and Adalı 2019),with an additional 258 sources added to this list. These ad-ditional sources are mostly conspiracy-driven and pseudo-science media sites from the Media Bias Fact Check list .Just as in the 2018 and 2019 versions, these sources comefrom a variety of countries (or the country of origin is notknown), but are all articles are in English. Despite having signiﬁcantly improved the robustnessof our collection in comparison to previous years,

NELA-GT-2020 suffered from data outage during a fewweeks due to technical issues. Speciﬁcally, our collection ismissing articles in weeks 13, 14, and 15, which correspondsto the the period within March th through April th . Us-ing linear interpolation on the data, we estimate that approx-imately 15,000 articles across this period were missed, thisaccounts for 0.8% of the dataset. Figure 1 shows the col-lection activity in each week. The estimated missing data isshown as a gray shaded area. Just as in NELA-GT-2019, the dataset has been released intwo formats: (1) a SQLite database, (2) a JSON dictionaryper news source. Details about the structure of each of theseformats is below. We provide Python code to read both dataformats at: https://github.com/MELALab/nela-gt . The SQLite 3 database schema consists of twi tables: newsdata and tweet . The newsdata table contains, ineach row, data about an article. Column id is set as primarykey to avoid duplicated entries on the database. We normal-ized source names by converting them to lower case, andremoving spaces, punctuation, and hyphens. For example,the source The New York Times appears as thenewyorktimes ,Tables 1 and 2 give information about data columns.

We also provide the dataset in JSON format. Speciﬁcally,each source has one JSON ﬁle containing the list of all of itsarticles. The ﬁelds follow the same structure of the databasecolumns (Tables 1 and 2). https://mediabiasfactcheck.com/ Just as in

NELA-GT-2019 and

NELA-GT-2018 ,we include multiple types of source-level labels. In

NELA-GT-2020 , we collect source-level labels fromMedia Bias/Fact Check (MBFC) that contain the followingdimensions of veracity:1. Media Bias Fact Check factuality score - on a scale from0 to 5 (low to high credibility).2. Media Bias Fact Check Conspiracy/Pseudoscience andquestionable sources - low credibility if a source belongsto these categories.Unlike previous versions of this dataset, we only in-clude veracity labels from MBFC. We choose to only in-clude these veracity labels for several reasons: 1. From ourknowledge, MBFC is the most complete and most updatedset of source-level veracity labels that are openly avail-able. 2. Other external rating services, such as NewsGuard( ), are not freely available.We encourage researchers to corroborate the ratings fromMBFC with other journalistic services if they are available.

One salient use case for this dataset is the study of COVID-19 media coverage. We provide a subset of the database thatincludes only COVID-19 related articles. This subset wasgenerated via a simple keyword search on article title andbody text. If an article had one or more keywords from ourset featured in the title or body text, it was included in theCOVID-19 subset. Figure 3a shows the number of news sto-ries related to COVID-19 compared to all articles collectedin each week of 2020.

Another salient use case for this dataset, is the study of nar-ratives and coverage surrounding the 2020 U.S. PresidentialElection. Using another simple keyword search on the arti-cle title and body text, we provide a subset of the databasethat only includes 2020 U.S. Election related articles. We ob-served that a total of 294,504 articles contained at least oneelection-related term, across 403 sources. Figure 3b showsthe distribution of election-related articles in each week of2020.

Embedded tweets are tweets that news publishers decidedto incorporate in their articles. A tweet may be the motiva-tion of a news story, or used as evidence in the story, or maybe itself the topic of the news story. We collected embed-ded tweets on the article page using the Goose3 library .The raw HTML code of the embedded tweet is stored in thedatabase table tweet , along with the id of the article fromwhich it was collected.In total, our dataset contains over 400,000 embeddedtweets collected from the news articles. A single article https://github.com/grangier/python-goose eliablemixed unreliable unlabeled Missing data A r t i c l e s (a) Number of articles per reliability class. E m b e dd e d T w ee t s (b) Number of embedded tweets per reliability class. A r t i c l e s (c) Number of articles per MBFC factuality score. E m b e dd e d T w ee t s (d) Number of embedded tweets per MBFC factuality score. very-lowlow mixedhigh very-highunlabeled Missing data Figure 1: Number of articles (a, c) and embedded tweets (b, d) collected during each week of 2020. (a) Number of sources per reliability class. (b) Number of sources per MBFC factuality score.

Figure 2: Distribution of sources per reliability class (a) and factuality (b) score. olumn Type Description id text (primary key) Article identiﬁer.date text Publication date string in YYYY-MM-DD format.source text Name of the source from which the article was collected.title text Headline of the article.content text Body text of the article.author text Author of the article (if available).published text Publication date time string as provided by source (inconsistent formatting).published utc integer Publication time as unix time stamp.collection utc integer Collection time as unix time stamp.url Text URL of the article.Table 1: Structure of NELA-GT-2020 article data. For the database format, column id is the primary key of table newsdata . Column Type Description id text (primary key) Tweet id.article id text (foreign key) Id of the article in which the embedded tweet was observed.embedded tweet text Raw HTML of the embedded tweet.Table 2: Structure of NELA-GT-2020 embedded tweets. For the database format, column id is the primary key of table tweet .may contain multiple embedded tweets, and a single tweetmay be embedded in multiple articles. This may be usedto construct a structured network of articles and tweets,which might prove, along with other signals, to be helpful inidentifying signals of source reliability (Horne, Nørregaard,and Adalı 2019a; Rozemberczki, Allen, and Sarkar 2019;Gruppi, Horne, and Adalı 2021; Starbird et al. 2018). Just as discussed in (Gruppi, Horne, and Adalı 2020), oneof our goals with the continued release of the NELA-GTdatasets is to support long-term news research. When com-bining all of the NELA datasets, we provide over 3.5 yearsof news data. There are multiple research avenues that thisdata, both in part and as a whole, supports:• Robust machine learning: A signiﬁcant amount of workhas been done in automated news veracity detection. Thisdataset allows for continued work in this area, particu-larly in robustness checks of current work. These robust-ness checks include examining prediction accuracy overtime, over events, and over mixed veracity labels. Ad-ditionally, a wide variety of methods outside of super-vised machine learning can be tested on this dataset, suchas semi-supervised news veracity detection and unsuper-vised news veracity detection.• Exploring event-driven dynamics of and narratives innews media: Quantitative and qualitative analyses of nar-rative themes before, during, and after major events con-tinues to be a useful methodology in interdisciplinary me-dia studies. This dataset supports these works by main-taining consistent data collection across events.• Examining media manipulation: Using the veracity labelsin this dataset, research can examine tactics used by pur-posely false news outlets. Additionally, with knowledgeof media manipulation campaigns, such as those outlined in the Media Manipulation Casebook , researchers canexamine how media manipulation is spread through ma-licious news outlets. While there has been a substantialfocus on “fake news” detection methods by researchers,there continues to be room in understanding and char-acterizing media manipulation and disinformation cam-paign tactics. With several highly controversial events, such as the manyrumors about the COVID-19 pandemic and the U.S. pres-idential elections, 2020 was a year with plenty of oppor-tunities for better understanding misinformation and forthe development of misinformation ﬁghting tools. We pre-sented

NELA-GT-2020 , a large scale dataset of news ar-ticles in English with source-level reliability labels. Thisdataset features several improvements on its predecessors

NELA-GT-2019 and

NELA-GT-2018 . First, it utilizes amore robust scraper, which is less susceptible to failuresand sporadic data outages. The collection activity shows amuch more smooth trajectory, as seen in Figure 1, with nodips apart from the reported data outage in late March/earlyApril. Furthermore,

NELA-GT-2020 introduces a novelfeature to the collection: embedded tweets. We believe thatthis additional piece of information may be useful in mod-eling relationships between news articles and/or sources, aswell as better understanding the relationship between newsmedia and social media. The dataset is available in SQLiteand JSON formats, minimal working examples can be foundin our code repository.

References [An and Gower 2009] An, S.-K., and Gower, K. K. 2009.How do the news media frame crises? a content analysis of https://mediamanipulation.org/ A r t i c l e s Missing dataAll articles COVID-19 articles (a) Number of COVID-19 articles in comparison to allarticles collected in NELA-GT-2020 in each week of2020. A r t i c l e s Missing dataAll articles U.S. elections articles (b) Number of articles related to the 2020 U.S. Presi-dential Election in comparison to all articles collected inNELA-GT-2020 in each week of 2020. .Figure 3: Number of articles related to (a) COVID-19, and number of articles related to (b) the U.S. Presidential Election as afraction of the total number of articles in each week of 2020.Figure 4: Relationship between news articles and embeddedtweets. A single embedded tweet may be referred to by mul-tiple articles and an article may contain multiple embeddedtweets. Embedded tweets are identiﬁed by their URL.crisis news coverage.

Public relations review arXiv preprintarXiv:1810.01765 .[Baly et al. 2020] Baly, R.; Martino, G. D. S.; Glass, J.;and Nakov, P. 2020. We can detect your bias: Predict-ing the political ideology of news articles. arXiv preprintarXiv:2010.05338 .[Golbeck et al. 2018] Golbeck, J.; Mauriello, M.; Auxier, B.;Bhanushali, K. H.; Bonk, C.; Bouzaghrane, M. A.; Buntain,C.; Chanduka, R.; Cheakalos, P.; Everett, J. B.; et al. 2018.Fake news vs satire: A dataset and analysis. In

WebSci , 17–21.[Gruppi, Horne, and Adalı 2020] Gruppi, M.; Horne, B. D.; and Adalı, S. 2020. Nela-gt-2019: A large multi-labellednews dataset for the study of misinformation in news arti-cles. arXiv preprint arXiv:2003.08444 .[Gruppi, Horne, and Adalı 2021] Gruppi, M.; Horne, B. D.;and Adalı, S. 2021. Tell me who your friends are: Usingcontent sharing behavior for news source veracity detection. arXiv preprint arXiv:2101.10973 .[Horne et al. 2020] Horne, B. D.; Nevo, D.; Adali, S.;Manikonda, L.; and Arrington, C. 2020. Tailoring heuris-tics and timing ai interventions for supporting news verac-ity assessments.

Computers in Human Behavior Reports

ICWSM , volume 12, 518–527. AAAI.[Horne, Nørregaard, and Adalı 2019a] Horne, B. D.;Nørregaard, J.; and Adalı, S. 2019a. Different spirals ofsameness: A study of content sharing in mainstream andalternative media. In

Proceedings of the International AAAIConference on Web and Social Media , volume 13, 257–266.[Horne, Nørregaard, and Adali 2019b] Horne, B. D.;Nørregaard, J.; and Adali, S. 2019b. Robust fake newsdetection over time and attack.

ACM Transactions onIntelligent Systems and Technology (TIST) arXiv preprintarXiv:2008.00791 .[Nørregaard, Horne, and Adalı 2019] Nørregaard, J.; Horne,B. D.; and Adalı, S. 2019. Nela-gt-2018: A large multi-labelled news dataset for the study of misinformation innews articles. In

ICWSM , volume 13, 630–638. AAAI.[Patwa et al. 2020] Patwa, P.; Sharma, S.; PYKL, S.; Guptha,V.; Kumari, G.; Akhtar, M. S.; Ekbal, A.; Das, A.; andChakraborty, T. 2020. Fighting an infodemic: Covid-19 fakenews dataset. arXiv preprint arXiv:2011.03327 .Rozemberczki, Allen, and Sarkar 2019] Rozemberczki, B.;Allen, C.; and Sarkar, R. 2019. Multi-scale attributed nodeembedding. arXiv preprint arXiv:1909.13021 .[Salem et al. 2019] Salem, F. K. A.; Al Feel, R.; Elbassuoni,S.; Jaber, M.; and Farah, M. 2019. Fa-kes: a fake newsdataset around the syrian war. In

ICWSM , volume 13, 573–582.[Shahi and Nandini 2020] Shahi, G. K., and Nandini, D.2020. Fakecovid–a multilingual cross-domain factcheck news dataset for covid-19. arXiv preprintarXiv:2006.11343 .[Shu et al. 2018] Shu, K.; Mahudeswaran, D.; Wang, S.; Lee,D.; and Liu, H. 2018. Fakenewsnet: A data repositorywith news content, social context and dynamic informationfor studying fake news on social media. arXiv preprintarXiv:1809.01286 .[Starbird et al. 2018] Starbird, K.; Arif, A.; Wilson, T.;Van Koevering, K.; Yeﬁmova, K.; and Scarnecchia, D.2018. Ecosystem or echo-system? exploring content shar-ing across alternative media domains.[Wang 2017] Wang, W. Y. 2017. “liar, liar pants on ﬁre”:A new benchmark dataset for fake news detection. arXivpreprint arXiv:1705.00648arXivpreprint arXiv:1705.00648