Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset
CCharacterizing COVID-19 Misinformation CommunitiesUsing a Novel Twitter Dataset
Shahan Ali [email protected] Kathleen M. [email protected]
Carnegie Mellon University
Abstract
From conspiracy theories to fake cures and fake treatments , COVID-19 has become a hot-bed for the spread of misinformation online. Itis more important than ever to identify meth-ods to debunk and correct false informationonline. In this paper, we present a methodol-ogy and analyses to characterize the two com-peting COVID-19 misinformation communi-ties online: (i) misinformed users or userswho are actively posting misinformation, and(ii) informed users or users who are activelyspreading true information, or calling out mis-information. The goals of this study are two-fold: (i) collecting a diverse set of annotatedCOVID-19 Twitter dataset that can be usedby the research community to conduct mean-ingful analysis; and (ii) characterizing the twotarget communities in terms of their networkstructure, linguistic patterns, and their mem-bership in other communities. Our analy-ses show that COVID-19 misinformed com-munities are denser, and more organized thaninformed communities, with a possibility ofa high volume of the misinformation beingpart of disinformation campaigns. Our anal-yses also suggest that a large majority ofmisinformed users may be anti-vaxxers. Fi-nally, our sociolinguistic analyses suggest thatCOVID-19 informed users tend to use morenarratives than misinformed users. With the emergence of COVID-19 pandemic, the polit-ical and medical misinformation has elevated to createwhat is being commonly referred to as the global info-demic . False information has hampered proper com-munication, and affected the decision-making process [BE + correct and change the beliefs of themisinformed communities. To be able to do this, itis important to understand how different communitiesinteract, which sub-communities they belong to, andwhat are their preferences. In this paper, we charac-terize the COVID-19 misinformation communities onTwitter in terms of their network structure, linguis-tic patterns, and membership in other misinformationand disinformation sub-communities. In the process,we also design and collect a large annotated datasetwith a comprehensive codebook that we make avail-able for the community to use for further analysis andmodels for misinformation detection. In the short amount of time, many COVID-19 datasetshave been released. Most of these datasets are generic,and lack annotations or labels. Examples include mul-tilingual corpus on a wide variety of topics relatedto COVID-19 [CLF20, AMEP +
20, HJB + + a r X i v : . [ c s . S I] S e p uotes [VCKBC20]. Most of these datasets either haveno annotations at all, employ automated annotationsusing transfer learning or semi-supervised methods, orare not specifically designed for misinformation.In terms of datasets collected for COVID-19 mis-information analysis and detection, examples includeCoAID [CL20] which contains automatic annotationsfor tweets, replies, and claims for fake news; ReCOV-ery [ZMFZ20] is a multimodal dataset annotated fortweets sharing reliable versus unreliable news, anno-tated via distant supervision; FakeCovid [SN20] is amultilingual cross-domain fake news detection datasetwith manual annotations; and [DSW20] is a large-scaleTwitter dataset also focused on fake news. A surveyof the different COVID-19 datasets can be found in[LUM +
20] and [SAAA20].In terms of the diversity of the classes, and the sizeof the dataset, the most relevant dataset is by Alam etal. [ASN +
20] who, like our study, present a compre-hensive codebook to annotate tweets on a finer granu-larity. Their dataset, however, is limited to a few hun-dred tweets, and our dataset is much more diverse inthe range of topics covered. Dharawat et al. [DLMZ20]present a similar dataset with focus on the severity ofthe misinformation. However, their dataset does notconsider the different “types” of misinformation. Fi-nally, Song et al. present a dataset in [SPJ +
20] whichcontains a diverse set of 10 categories, but still is notas large, and contains fewer categories in relation tothe dataset collected within our study.
A plethora of research has already been conductedfor analysing COVID-19 misinformation online. Someexamples include categorization and identificationof misinformed users based on their home coun-tries, social identities, and political affiliation [HC20,SSM + + content di-rectly. Instead, we conduct a set of analysis to under-stand and characterize the competing COVID-19 com-munities through their content, and content-sharingbehaviors and interactions. To collect Twitter dataset, we use Twitter search APIusing a diverse set of keywords as shown in table 1to collect data. We collected our data on three days:29th March 2020, 15th June 2020, and 24th June 2020.Each of these collections extracted a set of tweets fromtheir corresponding week. For the annotation process,tweets were randomly sampled from that set.Table 1: This table shows the hashtags, and keywordswe used in conjunction with “coronavirus” and “covid”to collect data from Twitter
Type TermsKeywords bleach, vaccine, acetic acid, steroids, essentialoil, saltwater, ethanol, children, kids, garlic,alcohol, chlorine, sesame oil, conspiracy, 5G,cure, colloidal silver, dryer, bioweapon, co-caine, hydroxychloroquine, chloroquine, gates,immune, poison, fake, treat, doctor, sennamakki, senna tea
Hashtags
Our annotation task aims to determine the categoryto which a given tweet belongs to. After many dis-cussions and revisions, we identify 17 categories that aparticular tweet could classify to. These 17 categoriesare defined in table 2. These categories are defined infurther detail along with their definitions and exam-ples in our codebook which we make available for thepublic to use.Based on these categories, tweets were randomlyand uniformly sampled from the data collection tomaintain diversity in terms of topics covered. In thefirst phase around 4573 tweets were annotated by asingle annotator. Table 2 shows the distribution ofthe data in terms of the different categories as anno-tated by the first annotator. In the second phase, 651of these annotated tweets were assigned randomly to6 other annotators.
Our data collection strategy is different from othersin two main aspects: (i) we have a diverse set of cate-gories taking into consideration different types of infor-mation and misinformation online; and (ii) our datasetis one of the very few, if not the only one, with em-phasis on informed communities with categories suchas “True Prevention”, “Calling out/correction”, “TruePublic Health Response”, and “Sarcasm”. We believeable 2: This table describes the categories we identi-fied to classify/annotate tweets along with the distri-bution of annotations as identified by Annotator 1 inthe first phase.
Category CountIrrelevant 131Conspiracy 924True Treatment 0True Prevention 175Fake Cure 141Fake Treatment 34False Fact or Prevention 321Correction/Calling out 1331Sarcasm/Satire 476True Public Health Response 163False Public Health Response 3Politics 512Ambiguous/Difficult to Classify 143Commercial Activity or Promotion 37Emergency Response 17News 95Panic Buying 70 this is necessary as building models requires not justthe annotation of false information, but as well as com-plementary true information categories.At the end, we have 4573 annotated tweets, com-prising of 3629 users with an average of 1 .
24 tweetsper user. Our annotated data not only covers awide range of categories as observed in table 2, butalso covers a wide range of topics as can be seenin figure 1. We call this dataset
CMU-MisCOV19 [MC20]. In adherence to the FAIR principles, thedatabase and the codebook has been uploaded toZenodo and is accessible with the following link:http://doi.org/10.5281/zenodo.4024154. In adherenceto the Twitter’s terms and conditions, we do not pro-vide the full tweet JSONs, but provide the tweet IDsso that the tweets can be rehydrated. We also pro-vide the annotations, and the date of creation for eachtweet for the reproduction of the results of our anal-yses. The annotated tweets are included in a CSVfile with the following fields: status id (tweet id of thetweet), status created at (timestamp of the creation ofthe tweet), annotation1 (annotated class of the tweetby the first annotator), and annotation2 (annotatedclass of the tweet by the second annotator, if exists).
Conducting analyses for a competing set of communi-ties requires identifying those communities first. Be-cause we have already annotated data across a setof true and false information categories, we identifythe membership of the users by assigning a valenceof +1 to the categories
True Treatment, True Pre-
Figure 1: This chart shows the frequency of each iden-tified topic across all the tweets. Note: Some tweetsmay have more than one topic. vention, Correction/Calling Out, Sarcasm/Satire, andTrue Public Health Response , and a valence of -1 tothe categories
Conspiracy, Fake Cure, Fake Treatment,False Fact or Prevention, and False Public Health Re-sponse . Note that we assign the valence to the cate-gories (or annotations) and not the tweets themselves.This is so that we can leverage the annotations frommultiple annotators. At the end, we compute the va-lence of each user as a weighted sum of the valence ofthe annotations assigned to their tweets. Then we usethe valence assigned to each user to identify their mem-bership i.e. if valence is greater than 0, the user is as-signed to the informed group, and if the valence is lessthan 0, the user is assigned to the misinformed group.Out of 3629 users, the community detection processassigns 47% (1697) of the users to the informed group,29% (1043) of the users to the misinformed group, and24% (889) of the users to ambiguous or irrelevant cat-egory . Because our goal is to characterize communities andtheir behaviors, once we identify the two communities,we collect the timelines of users in each community toaugment our data. Our hypothesis is that these addi-tional posts can be used to mitigate survivorship bias[BGIR92] within our analyses. To conduct networkanalysis, bot analysis, and sociolinguistic analysis, wefirst extract only the COVID-19 related tweets fromthe timelines of each user. We do this by filtering allthe tweets by the case-insensitive keywords “corona” and “covid” . This yields a total of 330609 tweets withan average of 91 tweets per user.
To conduct network analysis, we extract the retweet,mention, and reply networks of the two target com- Irrelevant users are users who have only posted tweetswithin other categories such as “Politics” or “Emergency”. Be-cause these categories do not have an assigned valence relatedto misinformation, they are not relevant for the purposes of thisstudy. unities, and combine those networks together. Wethen compute the network density for each of thetwo groups. As described in [MTMC20], networkdensity is defined as the ratio of actual connectionsand potential connections. In dense networks, confor-mity of the ideas is highly encouraged, and differenceof opinions is discouraged. We also use ORA-PRO[CRC, ACR17, ACR18] to plot the network graph asshown in figure 2Figure 2: Retweet+Mention+Reply network with in-formed users (in green) and misinformed users (inred) created using ORA-PRO [ACR18, Car17]. Note:Users with unidentified or ambiguous membershiphave been removed from the graph for simplicity.We note that both the informed and misinformedusers display echo-chamberness with misinformed sub-communities being much denser than the informedsub-communities as shown in table 3. We do, how-ever, notice some two-way communication from bothsides.Table 3: This table shows the number of nodes,links, and the network density for the two target sub-communities.
Measure Overall Informed MisinformedNodes 2477 1515 923Links 2947 1489 826Network Density 4.8e-4 6.5e-4 9.7e-4
We also plot the retweet, mention and reply net-work separately as shown in figure 3. While retweet,and mention network show little to no two-way com-munication, we can observe that the reply network,while small in size, does in fact have much more inter-group engagement. We hypothesize that this is likelya consequence of the “corrective” or “calling-out” be-havior.
To understand the role of bots within the two com-peting groups, we used Bot-Hunter [BC18b, BC18a,BCB +
18, BC20], which has a precision of .957 and arecall of .704, to identify potential bot-like accounts.We use the probability of greater than or equal to .75as our confidence threshold to identify bots. We use Figure 3: Retweet (left), mention (middle), and re-ply network (right) with informed users (in green) andmisinformed users (in red) created using ORA-PRO.[ACR18, Car17]a two-sample z-test for the difference of proportions( α = 0 .
05) to test the difference in proportion of botsbetween the two competing groups of users. The re-sults of our analyses can be found in table 4.Table 4: This table shows the number and percentageof bots within each of the two competing groups
Measure Overall Informed MisinformedNumber of Users 3629 1697 1043Number of Bots 505 184 202Percentage of Bots 14% 11% 19%
We observe that from a total of 3629 users, 14%(505) of the users are identified as bots. The per-centage of bots within identified misinformed users,however, is much higher (19%) than within identifiedinformed users (11%). We find our results to be sta-tistically significant ( p < . z = − . To understand the linguistic differences between thetwo competing communities, we conduct a linguisticanalysis based on the tweets of the two groups by usingthe Linguistic Inquiry and Word Count (LIWC) pro-gram [PBJB15]. LIWC is a text analysis tool whichlooks at the different lexical categories each of whichis psychologically meaningful. For a given text, LIWCcalculates the percentage of each LIWC categories. Allof these categories are based on word counts.We run the LIWC program on the timelines of allthe members for each of the two competing groups.We only use tweets relevant to COVID-19. We alsoremove users identified as bots. Because some usersmay be more active than others, using the results ofthe program as is may introduce biases in our anal-yses. To account for those biases, we first normalizethe percentages by the size of the data for each user.We use the mean of the normalized LIWC indices oftweets of individual users for a given lexical categoryas our test statistic. We use an independent z-test forthe difference in means to establish statistical signifi-able 5: This table shows the summary of our analyses across all the linguistic dimensions described above usingLIWC. The first column shows the lexical category. The second and third columns show the test statistic ( M )as the mean of the LIWC indices for informed and misinformed communities respectively. The fourth and fifthcolumns display the z-score and p-value for the independent z-test for the difference in means.Lexical Category M (Informed) M (Misinformed) z-score ( Z ) p-value ( Z )function 33.90 29.32 7.25 < . < . . < . < . < . . . < . . . . α = 0 .
05. Our analyses aresummarized in table 5.For this part, we focus on investigating three lin-guistic dimensions, each of which, along with its lin-guistic correlates, is described below.
Narratives play a central role in how individuals pro-cess information, communicate, and reason [Ves17].We set to test the differences in the usage of narra-tives or anecdotes between the two COVID-19 misin-formation communities. The LIWC correlates for nar-rative discourse structure include high usage of func-tion words, pronouns, analytic summary dimension,and authenticity. High usage of function words andpronouns happens more often when expressing feel-ings and behaviors which tends to happen frequently innarratives [Pen11]. Moreover, low analytical thinkingalso suggests narrative language [PBJB15]. Further-more, authentic individuals tend to be more personal,humble, and vulnerable [PBJB15]. Therefore, we useall of these as proxies to identify variation in the useof narratives across communities.In the past [MTMC20], it has also been suggestedthat misinformed communities (eg. anti-vaxxers) tendto use many more pronouns suggesting highly narra-tive discourse structure. In this analysis, however,we find that informed users in the COVID-19 dis-course use significantly more pronouns, more func-tional words, mention more family-related keywords,are less analytical, and more authentic and honest incomparison to misinformed users. All of these sug-gest that informed users may use many more narra-tives than misinformed users. This is an interestingfinding as it presents a dichotomy between the differ- ent misinformation communities (eg. anti-vaxxers andCOVID-19 misinformed community). In hindsight,this is also an intuitive result, as our informed group isobtained from corrective discourse where users presenttheir stories of family members or friends sufferingfrom COVID-19 to call out conspiracies and false in-formation. Because the two communities still seem tohave less two-way communication, this also suggeststhat just the content and framing of the message (i.e.narratives) may not be enough, and perhaps there isa need to connect the two groups by identifying aneffective medium of communication.
Tone describes how positive a given text is. Accordingto the definition by LIWC, the higher the tone index,the more positive the tone. Indices less than 50 typi-cally suggest a more negative tone. While we do notsee significant differences in the emotional tone of thecompeting groups, we find both the communities to behighly negative.
Formality of the language has often been consideredas one of the most important dimensions for stylisticvariation. In [GMC + To understand the interplay between the differentkinds of misinformation themes and communities, weidentify the vaccination-related stance of the mem-bers of the misinformed sub-community. To do that,we first identify the subset of misinformed communitywho have posted at least one tweet related to “vac-cines” in the past. We then collect the user-to-hashtagco-occurrence network. We use the valence of the vac-cination hashtags obtained via the label propagation-based method mentioned in the study in [MTMC20]to identify the stance of each member (pro vs. anti)based on the weighted sum of the valences of the hash-tags. If the weighted sum is greater than 0, we identifythe member as pro-vaxxer, and if the weighted sum isless than 0, we identify the member as anti-vaxxer.The distribution of the pro- and anti-vaxxers withinthe COVID-19 misinformed group is as shown in table6.Table 6: This table shows the number and percentageof pro- and anti-vaxxers within the misinformed group.
Measure ValueUsers w/ vaccine-related tweets 2750 (out of 3629)Misinformed users 1027 (37%)Anti-vaxxers 423 (41%)Pro-vaxxers 224 (22%)Ambiguous 380 (37%)Misinformed pro-vaxxer bots 37 (17%)Misinformed anti-vaxxer bots 82 (22%)
We observe that from 1027 COVID-19 misinformedusers in our dataset, 41% of the members are identi-fied as anti-vaxxers, whereas only 22% of the membersare identified as pro-vaxxers. The difference betweenthe proportions of the two communities is significantlyhigh. We also identify the proportion of bots withineach of the two groups: misinformed pro-vaxxers , and misinformed anti-vaxxers . As shown in table 6, 17%of the misinformed pro-vaxxers are bots, which is sig-nificantly lower than the proportion of bots within themisinformed anti-vaxxers. The first thing this suggestsis that a big chunk of COVID-19 misinformation onlinemay in fact be disinformation , and hence, intentional. The existence of bots within both the informed andmisinformed communities also suggests that much ofthe disinformation online may be an organized effortto amplify the COVID-19 debate to create discord inthe communities as seen in the past with Twitter botsand Russian trolls [BJQ + The first important limitation pertaining to our workis that most of our analyses are based on the data thathas been annotated by only 1 annotator. We try tomitigate this by having more than 1/7th of our anno-tations annotated by a second annotator, and takinginto account all those annotations while computing themembership for each user. Another limitation to ourwork is that all our analyses are correlational in nature,and do not depict causation. A limitation pertainingto our data collection strategy is that we collect ourdata across a period of three weeks, augment our datawith timelines of users, and update our list of hashtagsto account for new themes. We then sample a subsetof this data for annotation process. Because of theway data was collected, it cannot be used for assess-ing change over time. Moreover, while this ensures thediversity of misinformation-related topics and agents,it may limit our ability to estimate the actual extentto which the different types of stories are more or lesspresent. Another limitation related to our bot analysisis that we use a second-level inference from a trainedmodel. We try to mitigate this by using labels withprobability greater than or equal to .75 to ensure highquality labels. Finally, unlike “vaccination” relateddiscourse, COVID-19 does not have a clear definitionof the “stance” of the users. This is because thereare many sub-topics associated to COVID-19 each ofwhich could have its own stance. In this work, we cat-egorize users based on misinformation. However, therelationship between misinformation and stance vis-a-vis issues is complex, and one that needs to be un-derstood. In the future work, we hope to explore thisrelationship to create a systematic way of characteriz-ing communities both in terms of misinformation, andthe difference stances of the users.
In this paper, we present a methodology to charac-terize the competing COVID-19 misinformation com-munities by comparing them in terms of their networkstructure, sociolinguistic variation, and membership indisinformation campaigns and in other health-relatedmisinformation communities such as anti-vaxxers. Wefind that even though COVID-19 is a recent event,misinformation related to it has created a set of polar-ized communities with high echo-chamberness. Mis-nformed communities are observed to be denser thaninformed communities which is in line with previousstudies such as [MTMC20]. We find that bots ex-ist in both the informed and misinformed groups, butthe percentage of bots in misinformed users is signif-icantly higher suggesting the prevalence of disinfor-mation campaigns. Our sociolinguistic analysis sug-gests that both the target communities depict nega-tive emotional tone in their posts, with signals thatinformed users use many more narratives than mis-informed users. Finally, we discover that many mis-informed users may be anti-vaxxers. Our analysessuggest that misinformation communities are muchmore complex as they are highly organized, and tendto be highly analytical. Unlike previous suggestions[SOC19], they may not be responsive to narrative cor-rectives, and hence, a “one size fits all” generic mes-saging intervention for debunking misinformation maynot be a feasible solution. A successful interventionmay require to identify, and ban the disinformationcampaigns. It may also be useful to identify the rightmedium of communication to connect the two groups.This can be achieved by identifying users in misin-formed communities who are not rebroadcasting , orhave high betweenness centrality to be messengers fordisseminating factual information. It may also be use-ful to further understand the linguistic patterns andpreferences of these communities to create an effective content and framing of the messaging.
This work was partially supported by a fellowshipfrom Carnegie Mellon Universitys Center for MachineLearning and Health to Shahan A. Memon. We thankDavid Beskow for access to his Bot-Hunter model forbot analysis. We also thank members of CMU’s Cen-ter for Computational Analysis of Social and Orga-nizational Systems (CASOS) for insightful commentsand discussions related to the data codebook and itsrevisions.
References [AAA20] Sarah Alqurashi, Ahmad Alhindi, andEisa Alanazi. Large arabic twitterdataset on covid-19. arXiv preprintarXiv:2004.04315 , 2020.[ACR17] Neal Altman, Kathleen M Carley, andJeffrey Reminga. Ora users guide 2017.
Carnegie-Mellon Univ. Pittsburgh PAInst of Software Research International,Tech. Rep. , 2017.[ACR18] Neal Altman, Kathleen M Carley, andJeffrey Reminga. Ora users guide 2018.
Carnegie-Mellon Univ. Pittsburgh PAInst of Software Research International,Tech. Rep. , 2018.[AMEP +
20] Muhammad Abdul-Mageed, Abdel-Rahim Elmadany, Dinesh Pabbi, KunalVerma, and Rannie Lin. Mega-cov:A billion-scale dataset of 65 lan-guages for covid-19. arXiv preprintarXiv:2005.06012 , 2020.[ASN +
20] Firoj Alam, Shaden Shaar, Alex Nikolov,Hamdy Mubarak, Giovanni Da San Mar-tino, Ahmed Abdelali, Fahim Dalvi,Nadir Durrani, Hassan Sajjad, KareemDarwish, et al. Fighting the covid-19infodemic: Modeling the perspective ofjournalists, fact-checkers, social mediaplatforms, policy makers, and the society. arXiv preprint arXiv:2005.00033 , 2020.[BC18a] David M Beskow and Kathleen M Carley.Bot conversations are different: lever-aging network metrics for bot detectionin twitter. In , pages 825–832. IEEE, 2018.[BC18b] David M Beskow and Kathleen M Car-ley. Bot-hunter: a tiered approach to de-tecting & characterizing automated ac-tivity on twitter. In
Conference pa-per. SBP-BRiMS: International Confer-ence on Social Computing, Behavioral-Cultural Modeling and Prediction andBehavior Representation in Modeling andSimulation , 2018.[BC20] David Beskow and Kathleen M Carley.
Social Cybersecurity . Springer, 2020.[BCB +
18] David Beskow, Kathleen M Carley, HalilBisgin, Ayaz Hyder, Chris Dancy, andRobert Thomson. Introducing both-unter: A tiered approach to detectionand characterizing automated activityon twitter. In
International Confer-ence on Social Computing, Behavioral-Cultural Modeling and Prediction andBehavior Representation in Modeling andSimulation. Springer , 2018.[BE +
20] Darrin Baines, RJ Elliott, et al. Definingmisinformation, disinformation and mal-information: An urgent need for clarityduring the covid-19 infodemic.
Discus-sion Papers , pages 20–06, 2020.BGIR92] Stephen J Brown, William Goetzmann,Roger G Ibbotson, and Stephen A Ross.Survivorship bias in performance stud-ies.
The Review of Financial Studies ,5(4):553–580, 1992.[BJQ +
18] David A Broniatowski, Amelia M Jami-son, SiHua Qi, Lulwah AlKulaib, TaoChen, Adrian Benton, Sandra C Quinn,and Mark Dredze. Weaponized healthcommunication: Twitter bots and rus-sian trolls amplify the vaccine de-bate.
American journal of public health ,108(10):1378–1384, 2018.[BKF +
20] David A Broniatowski, Daniel Kerch-ner, Fouzia Farooq, Xiaolei Huang,Amelia M Jamison, Mark Dredze, andSandra Crouse Quinn. The covid-19 so-cial media infodemic reflects uncertaintyand state-sponsored propaganda. arXivpreprint arXiv:2007.09682 , 2020.[BSHN20] J Scott Brennen, Felix Simon, Philip NHoward, and Rasmus Kleis Nielsen.Types, sources, and claims of covid-19misinformation.
Reuters Institute , 7,2020.[BTW +
20] Juan M Banda, Ramya Tekumalla,Guanyu Wang, Jingyuan Yu, Tuo Liu,Yuning Ding, and Gerardo Chowell.A large-scale covid-19 twitter chatterdataset for open scientific research–an in-ternational collaboration. arXiv preprintarXiv:2004.03688 , 2020.[Car17] Kathleen M Carley. Ora: A toolkit fordynamic network analysis and visualiza-tion., 2017.[CJHJA17] Man-pui Sally Chan, Christopher RJones, Kathleen Hall Jamieson, and Do-lores Albarrac´ın. Debunking: A meta-analysis of the psychological efficacyof messages countering misinformation.
Psychological science , 28(11):1531–1546,2017.[CL20] Limeng Cui and Dongwon Lee.Coaid: Covid-19 healthcare misin-formation dataset. arXiv preprintarXiv:2006.00885 , 2020.[CLF20] Emily Chen, Kristina Lerman, andEmilio Ferrara. Tracking social me-dia discourse about the covid-19 pan-demic: Development of a public coro- navirus twitter data set.
JMIR Pub-lic Health and Surveillance , 6(2):e19273,2020.[CRC] L Richard Carley, Jeff Reminga, andKathleen M Carley. Ora & netmapper.[DLMZ20] Arkin R Dharawat, Ismini Lourent-zou, Alex Morales, and ChengXiangZhai. Drink bleach or do what now?covid-hera: A dataset for risk-informedhealth decision making in the presence ofcovid19 misinformation. 2020.[DSW20] Enyan Dai, Yiwei Sun, and SuhangWang. Ginger cannot cure cancer: Bat-tling fake health news with a comprehen-sive data repository. In
Proceedings of theInternational AAAI Conference on Weband Social Media , volume 14, pages 853–862, 2020.[Fer20] Emilio Ferrara. What types of covid-19 conspiracies are populated by twitterbots?
First Monday , 2020.[GMC +
14] Arthur C Graesser, Danielle S McNa-mara, Zhiqang Cai, Mark Conley, Haiy-ing Li, and James Pennebaker. Coh-metrix measures text characteristics atmultiple levels of language and dis-course.
The Elementary School Journal ,115(2):210–229, 2014.[HC20] Binxuan Huang and Kathleen M Carley.Disinformation and misinformation ontwitter during the novel coronavirus out-break. arXiv preprint arXiv:2006.04278 ,2020.[HHSE20] Fatima Haouari, Maram Hasanain, ReemSuwaileh, and Tamer Elsayed. Arcov-19: The first arabic covid-19 twit-ter dataset with propagation networks. arXiv preprint arXiv:2004.05861 , 2020.[HJB +
20] Xiaolei Huang, Amelia Jamison, DavidBroniatowski, Sandra Quinn, and MarkDredze. Coronavirus twitter data: Acollection of covid-19 tweets with auto-mated annotations, 2020.[LUM +
20] Siddique Latif, Muhammad Usman,Sanaullah Manzoor, Waleed Iqbal, Ju-naid Qadir, Gareth Tyson, Ignacio Cas-tro, Adeel Razi, Maged N Kamel Boulos,Adrian Weller, et al. Leveraging data sci-ence to combat covid-19: A comprehen-sive review. 2020.MC20] Shahan Ali Memon and Kathleen M.Carley. Cmu-miscov19: A novel twitterdataset for characterizing covid-19 misin-formation, Sep 2020.[MTMC20] Shahan Ali Memon, Aman Tyagi,David R Mortensen, and Kathleen MCarley. Characterizing sociolinguis-tic variation in the competing vacci-nation communities. arXiv preprintarXiv:2006.04334 , 2020.[OPR20] Catherine Ordun, Sanjay Purushotham,and Edward Raff. Exploratory analy-sis of covid-19 tweets using topic model-ing, umap, and digraphs. arXiv preprintarXiv:2005.03082 , 2020.[PBJB15] James W Pennebaker, Ryan L Boyd,Kayla Jordan, and Kate Blackburn. Thedevelopment and psychometric proper-ties of liwc2015. Technical report, 2015.[Pen11] James W Pennebaker. The secret life ofpronouns.
New Scientist , 211(2828):42–45, 2011.[QIO20] Umair Qazi, Muhammad Imran, andFerda Ofli. Geocov19: a dataset of hun-dreds of millions of multilingual covid-19 tweets with location information.
SIGSPATIAL Special , 12(1):6–15, 2020.[SAAA20] Junaid Shuja, Eisa Alanazi, Waleed Alas-mary, and Abdulaziz Alashaikh. Covid-19 open source data sets: A comprehen-sive survey. medRxiv , 2020.[SDM20] Gautam Kishore Shahi, Anne Dirkson,and Tim A Majchrzak. An exploratorystudy of covid-19 misinformation on twit-ter. arXiv preprint arXiv:2005.05710 ,2020.[SN20] Gautam Kishore Shahi and DurgeshNandini. Fakecovid–a multilingual cross-domain fact check news dataset for covid-19. arXiv preprint arXiv:2006.11343 ,2020.[SOC19] Angeline Sangalang, Yotam Ophir, andJoseph N Cappella. The potential fornarrative correctives to combat misin-formation.
Journal of communication ,69(3):298–319, 2019.[SPJ +
20] Xingyi Song, Johann Petrak, Ye Jiang,Iknoor Singh, Diana Maynard, and Kalina Bontcheva. Classification awareneural topic model and its applicationon a new covid-19 disinformation corpus. arXiv preprint arXiv:2006.03354 , 2020.[SSM +
20] Karishma Sharma, Sungyong Seo,Chuizheng Meng, Sirisha Rambhatla,and Yan Liu. Covid-19 on socialmedia: Analyzing misinformation intwitter conversations. arXiv preprintarXiv:2003.12309 , 2020.[TLC15] Andy SL Tan, Chul-joo Lee, and JiyoungChae. Exposure to health (mis) infor-mation: Lagged effects on young adults’health behaviors and potential pathways.
Journal of Communication , 65(4):674–698, 2015.[VCKBC20] Ramon Villa-Cox, Sumeet Kumar,Matthew Babcock, and Kathleen MCarley. Stance in replies and quotes(srq): A new dataset for learning stancein twitter conversations. arXiv preprintarXiv:2006.00691 , 2020.[Ves17] Marcela Veselkov´a. Narrative policyframework: Narratives as heuristics inthe policy process.
Human Affairs ,27(2):178, 2017.[YTLM20] Kai-Cheng Yang, Christopher Torres-Lugo, and Filippo Menczer. Prevalenceof low-credibility information on twit-ter during the covid-19 outbreak. arXivpreprint arXiv:2004.14484 , 2020.[ZMFZ20] Xinyi Zhou, Apurva Mulay, Emilio Fer-rara, and Reza Zafarani. Recovery:A multimodal repository for covid-19news credibility research. arXiv preprintarXiv:2006.05557arXiv preprintarXiv:2006.05557