Malicious and Low Credibility URLs on Twitter during the AstraZeneca COVID-19 Vaccine Development
Sameera Horawalavithana, Ravindu De Silva, Mohamed Nabeel, Charitha Elvitigala, Primal Wijesekara, Adriana Iamnitchi
MMalicious and Low Credibility URLs on Twitter duringCOVID-19
Sameera Horawalavithana
CSE, University of South [email protected]
Ravindu De Silva
SCoRe [email protected]
Mohamed Nabeel
Qatar Computing Research [email protected]
Charitha Elvitigala
SCoRe [email protected]
Primal Wijesekara
University of California, [email protected]
Adriana Iamnitchi
CSE, University of South [email protected]
This study provides an in-depth analysis of a Twitter dataset aroundAstraZeneca COVID vaccine development released as a part ofGrand Challenge , North American Social Network Conference,2021. In brief, we show:(1) the presence of malicious and low credibility informationsources shared on Twitter messages in multiple languages.(2) the malicious URLs, often in shortened forms, are increas-ingly hosted in content delivery networks and shared cloudhosting infrastructures not only to improve reach but alsoto avoid being detected and blocked.(3) potential signs of coordination to promote both maliciousand low credibility URLs on Twitter. We use a null model andseveral statistical tests to identify meaningful coordinationbehavior. Our analysis identifies the patterns related to URLs and web do-mains cited on Twitter messages. We use multiple features relatedto URL mentions including user sharing activities, credibility of theinformation source, maliciousness of the web domain, the articlecontent, etc. to guide our analysis.To understand the patterns of sharing URL information in Twitterconversations, we only consider the Twitter messages that containat least a URL, which consist of 25% from the total messages. Asshown in Figure 1, they are shared between August 17 and Septem-ber 10, 2020. We filtered out URLs to Twitter itself that typically referto other tweets. In addition, the external links (e.g., a tweet mention-ing a YouTube video, or an external website domain) mentioned inthe messages are pre-processed as following. The shortened URLsare expanded, and HTML parameters are removed from the URLs.The YouTube URLs are resolved to the base URL if they include aparameter referencing a specific time in the video. We representthe URL by the parent domain when multiple child domains exist(e.g., fr.sputniknews.com , arabic.sputniknews.com etc. are re-named as sputniknews.com ). This pre-processing code of resolvingURLs is publicly available [14]. In this analysis, we do not differen-tiate among different message types (e.g., tweets, retweets, repliesand quotes), but consider all messages as URL mentions.For additional analysis, we scrape the web page content pointedto an URL using a Python library, Newspaper. We managed toscrape 22,856 articles. The tool failed to extract content from some web domains mainly due to inactive web pages and regulationsenforced by the web domain. We first show the basic characteristics of the URLs shared on Twitter(Section 3.1). Second, we report the characteristics of low credibilityURLs (Section 3.2) and malicious URLs (Section 3.3). Finally, weshow coordination patterns within the groups of users who attemptto promote both low credibility and malicious content (Section 3.4).
The main characteristics of the dataset are listed in Table 1. Theresulting dataset consists of 85,064 Twitter messages.Figure 1 shows a spike in URL mentions on September 9, 2020.This is mainly due to the reports of the halt in AstraZeneca vaccinedevelopment following a suspected adverse reaction of a trial par-ticipant. There are 23,620 (28%) messages citing distinct 8,566 URLson this particular day. The most popular URLs point mainstreamnews articles, while the article published in statnews.com receivedthe highest number (1,158) of shares .According to the Langid [10], the majority of the articles are inEnglish (6,453) and Spanish (5,197) languages. Other articles arewritten in Turkish (1,839), French (1,539), Portuguese (1,277), Rus-sian (784), Italian (772), Greek (708), Japanese (702), Croatian (447),etc. There are 274 and 165 YouTube videos shared on Twitter havethe title written in English and Spanish languages, respectively. Table 1: Basic characteristics of Twitter URL Dataset.
In this section, we report how Twitter users react to low credibil-ity information sources. We group the web domains according tothe classification made by Media Bias/ Fact Check (MBFC) [2]. Wealso cross-check these domains with the list of web sites that pub-lish False COVID-19 information as identified by NewsGuard [3]. a r X i v : . [ c s . S I] F e b issing Data Figure 1: Number of Twitter URL mentions over time. Thereis a known gap in the data collection between August 27-31
We identify 6,860 (26%) URLs from 17 low credibility informationsources that are shared on 26,220 (31%) messages (as shown inTable 2). sputniknews.com is the most popular web domain bythe number of mentions (22,251), the number of users (4,358) andthe number of URLs (6,638). This can be expected mainly due to anetwork of sputniknews.com media outlets that publish articles inmany languages. For example, 1,377 and 1,080 sputniknews.com ar-ticles are published in Turkish and French. However, sputniknews.com
URLs are not the most popular to be shared in the immediate hoursafter their first appearance (as shown in Figure 2b). For example, themedian number of sputniknews.com
URL mentions is 2 in an hour. sputniknews.com is frequently described as a Russian propagandaoutlet that spreads master narratives in the Russia’s disinformationcampaign [2].We noticed that rt.com acquires its many mentions from veryfew URLs: 119 URLs are cited in 3,225 messages during our obser-vation period (Figure 2a). rt.com
URLs have the highest number ofmentions in the first hour after publication (as shown in Figure 2b)relative to the rest of popular web sites.We also note that Twitter messages share the same article head-ing in the messages when they cite the same URL. These users mightpromote certain topics through massive repetition of messages viainjecting URLs. For example, an article published in zerohedge.com was in the Top-10 most popular URLs on the day when AstraZenezavaccine development halted . However, this article tried to build analternative frame highlighting a statement by the US House SpeakerNancy Pelosi about the issue instead of reporting the details of themain event. This technique can spoof the Twitter trending algo-rithm to broadcast their message towards a wider audience and toestablish their preferred narrative via artificial amplification.We also discovered URLs from two Russian news media do-mains, sputnik-abkhazia.ru and moskva-tyt.ru that are sharedby very few users. For example, sputnik-abkhazia.ru is the fourthmost popular web domain by the number of URLs cited on Twittermessages. These URLs are shared solely by the Twitter accountcontrolled by the news media site. However, they gain limited en-gagement with other users.Another objective is to compare and contrast the linguistic pat-terns used in the misleading and credible news articles. We used the Table 2: Twitter sharing characteristics for low credibilityURLs as identified by MBFC and NewsGuard (NG). We usethe ✓ mark to reflect whether the domain is being listed asa low credibility information source by the respective factchecking organization. Domain MBFC NG ✓ ✓ ✓ ✓ ✓ - 117 6 62zerohedge.com ✓ ✓
278 249 15swarajyamag.com ✓ - 228 105 5oann.com ✓ ✓
50 48 4childrenshealthdefense.org ✓ ✓ ✓ ✓
16 7 3torontotoday.net ✓ - 9 5 2gnews.org ✓ ✓ ✓ ✓
29 29 1needtoknow.news ✓ - 5 5 1oye.news ✓ ✓ ✓ ✓ ✓ - 1 1 1wakingtimes.com ✓ ✓ ✓ ✓ articles hosted at two most popular domains sputniknews.com and reuters.com for this analysis. While we know sputniknews.com publishes misleading, questionable content [2, 3], reuters.com isa credible mainstream news media outlet. To analyze the linguisticpatterns of the words used in the articles, we use LIWC to count thewords associated with different psychological categories. LIWC [18]is a linguistic dictionary that contains 4,500 words categorized toanalyze psycholinguistic patterns in text. We use six categories inthe LIWC dictionary: Social processes, Affective processes, Cog-nitive processes, Perceptual processes, Biological processes andRelativity. Figure 3 shows the fraction of words used in the articlesfor each LIWC category.It is interesting to note the lack of words used in sputniknews.com articles to cover the main discussion topics related to the biologicalprocesses compared to what reported in reuters.com (Figure 3).We noticed sputniknews.com articles used more words that areassociated with social processes (e.g., friends, family), affectiveprocesses (e.g., anxiety, sadness), and cognitive processes (e.g., dis-crepancy, tentative) than what are used in the articles publishedby reuters.com . This suggests that the sputniknews.com articlesused a different writing style in their articles. We use VirusTotal (VT) [19] to extract the maliciousness of URLs. VTprovides the state-of-the-art aggregated intelligence for domainsand URLs, and relies on more than 70 third-party updated antivirus(AV) engines. For all distinct URLs in our collection, we extract VTscan reports via querying the publicly available API. Each VT scanreport contains of the verdict from every AV engine, informationrelated to the URL such as first and last seen dates of the URL in theVT system, hosting IP address, final redirected URL (if applicable),content length, etc. Each AV engine in a VT report detects if theURL is malicious or not. We use the number of engines that detect aURL as malicious as an indication of the maliciousness of the URL. M e n t i o n s sputniknews.comyoutube.comreuters.comsputnik-abkhazia.runewsfilter.io instagram.com nytimes.comrt.commoskva-tyt.ru statnews.com (a) Top-10 shared domains
10 20 30 40
Time Progression (Hour) M e d i a n U R L M e n t i o n s sputniknews.comyoutube.comreuters.comnewsfilter.ionytimes.comrt.comstatnews.com (b) Twitter lifespan of URLs by domain Figure 2: Twitter sharing characteristics of most popular domains. Figure a) shows the Top-10 domains by the number ofdistinct URLs. The size of the markers in this plot are proportional to the number of URLs associated with the domain. Figureb) shows the number of URL mentions posted at each hour after the first appearance of URL in the respective domain. Wecount the number of mentions for each URL in the respective hour, and calculate the median number of mentions for thedomain of the URL.Table 3: The most popular domains hosting articles written in multiple languages. sputniknews.com and youtube.com are amongthe most popular domains.
English sputniknews.comyoutube.com reuters.comnewsfilter.ionytimes.com Spanish sputniknews.comyoutube.com rt.comreuters.comwp.me Turkish sputniknews.comyoutube.com ntv.com.tris.gdsozcu.com.tr French sputniknews.com cvitrolles.wordpress.com youtube.com francetvinfo.frlalibre.be Portugese sputniknews.com brasil247.comevsarteblog.wordpress.com youtube.com tvi24.iol.pt S o c i a l A ff e c t i v e C o g n i t i v e P e r c e p t u a l B i o l o g i c a l R e l a t i v i t y Psychological Processes (LIWC) F r a c t i o n o f W o r d s i n t h e N e w s A r t i c l e sputniknews.comreuters.com Figure 3: Lexical patterns used in the articles hosted at sputniknews.com and reuters.com
In this study, we label a URL as malicious if at least one AV engine,that is ≥
1, detects it as malicious. We identify 441 maliciousURLs from the collected VT reports. We observe that 25.75% of themalicious URLs utilize URL shortening services with top 4 servicesbeing bit.ly , tinyurl.com , ow.ly and goo.su whereas as only 0.97% of benign URLs utilize such services. This observation isconsistent with the trend that malicious actors increasingly utilizeshortening services to camouflage malicious URLs presenting non-suspecting URLs to users [7]. We find that 30.80% of the domainsrelated to malicious URLs are ranked below 100K by Alexa [1] (Thelower the rank value, the higher the popularity). This indicates theconcerning fact that malicious actors are able to reach a large userbase reaping a high return on investment for their attacks.We further analyze the malicious URLs to identify related mali-cious URLs. To this end, based on the lexical features in the litera-ture and the hosting features mentioned in Table 4, we cluster themalicious URLs using PCA/OPTICS algorithm [13]. While lexicalfeatures identify characteristics related to URLs themselves, host-ing features, extracted from Farsight Passive DNS (PDNS) data [5],capture the characteristics of underlying hosting infrastructure. Asshown in Figure 4, these features collectively identify 10 distinctmalicious URL clusters. We manually verified the accuracy of thetop 2 clusters by checking the web page content, registration in-formation and domain certificate information. Our observationssuggest that attackers launch multiple attacks at the same time. e also analyze the clusters based on the maliciousness of URLs.The maliciousness of a URL can loosely be measured by Table 4: Details of the hosting features for malicious URLclustering.Feature Description
VT_Dur URL duration in VTPDNS_Dur Domain duration in PDNS [4]
As reported previously [12, 16], there are signs of coordinationto spread the disinformation content. In this work, one of our as-sumptions is that the coordination is based on the content beingshared (i.e., URLs mentioned in the tweets). One of our objectivesis to measure the extent of coordination in URL sharing activities.We construct two bipartite networks to quantify the amount ofsuch coordination effort. The first network connects an author to alow credibility URL mentioned in a tweet, and the second networkconnects an author to a malicious URL mentioned in a tweet. Thesenetworks would capture any suspicious behavior of promoting aparticular URL.
User-(low credibility) URLs Bipartite Network:
We identify6,860 low credibility information URLs that are shared by 6,600 usersin 26,220 Twitter messages. The five (0.08%) most active users citelow credibility URLs in 6,404 (24%) messages. Their most preferredinformation source is tr.sputniknews.com which is the Turkishoutlet of the sputnik media network.
Figure 4: Malicious URL clusters based on the lexical andhosting features. Each point is a URL, and it is colored ac-cording to the cluster it belongs.User-(malicious) URLs Bipartite Network:
We identify 441malicious URLs that are shared by 357 users in 571 Twitter mes-sages. While a user might share a malicious URL unwittingly, it issuspicious to note users who share multiple malicious URLs. Specif-ically, we identify 51 users who share more than 1 malicious URL,and 21 users who share more than 2 malicious URLs.Figures 5a and 5b show the distributions of bipartite clusteringcoefficients [9] for URLs in the two bipartite networks respectively.Clustering coefficient values are higher for URLs when they areshared by a group of users who engage with other URLs together.We compare the similar clustering values which are calculated fromthe identical random bipartite networks. We construct two randombipartite networks using the Newman’s configuration model [11]for the comparison. Given the original user-URL bipartite network,we match the two degree sequences in the users and URLs in therandom bipartite network. We noticed a significant deviation ofclustering coefficient values for URLs in both bipartite networkscompared to URLs in the random bipartite networks (as shown inFigures 5a and 5b). For example, there are 4,320 (63%) low credibilityURLs with a clustering coefficient value greater than 0.8 than thosenumber of URLs (217) in the random bipartite network. In the mali-cious URL network, there are 121 URLs with a clustering coefficientvalue greater than 0.8 that are promoted by the same set of users(the expected number of URLs is 36 in the random network).To confirm our observation, we also perform the Kolmogorov-Smirnov (KS) test between the clustering coefficient values fromthe original and random bipartite networks. KS-statistic values(statistically significant) are 0.6 and 0.27 for the low credibility URLnetwork and malicious URL network, respectively. This suggests thesustained effort of users to amplify both misleading and maliciouscontent. There may be different types of users (e.g., bots, cyborgs,paid activists) who amplify these URLs. That would remain as afuture work to identify these types of users. .0 0.2 0.4 0.6 0.8 1.0 Bipartite Clustering Coefficient C D F COVID-19/ Low-credibilityRandom (a) Coordination to promote low credibility URLs (KS=0.6,p-value=0.001)
Bipartite Clustering Coefficient C D F COVID-19/ MaliciousRandom (b) Coordination to promote malicious URLs (KS=0.27,p-value=4.45 𝑒 − ) Figure 5: Bipartite clustering coefficient of a) the user-(low credibility) URL network, b) the user-(malicious) URL network. Wealso compare the network clustering values with a random bipartite network using the Newman’s configuration model [11].The deviation of the clustering values from the random bipartite network shows potential coordination effort to promotethese URLs on Twitter.
In times of crisis, whether political or health-related, online dis-information is amplified by social media promotion of alternativemedia outlets [17]. This study adds to the growing body of work [6]that analyzes the misinformation activity during the COVID-19crisis by studying the sharing of URLs on Twitter between August17 and September 10, 2020. Our contributions complement previousobservations [7, 15] in multiple ways. We discover a strong pres-ence of malicious and low credibility information sources shared onTwitter messages in multiple languages. Not only that URLs fromlow credibility sources, as classified by independent bodies such asNewsGuard and Media Bias/Fact Check, were present, but manywere shown to point to pages with malicious code. A significant por-tion of these URLs (25%) were in shortened form (compared to under1% of the non-malicious URLs) and hosted on well-established, rep-utable content delivery networks (such as Cloudflare and Akamai)in an attempt, we believe, to avoid detection.We also discovered potential signs of coordination to promotemalicious and low credibility URLs on Twitter. Specifically, wediscovered unusual clustering of user activity related to the sharingof such URLs. We use a null model and several statistical tests tocompare expected behavior with what we suspect to be coordinatedbehavior as seen in this dataset.This study is another datapoint in the continuous effort [8] tounmask malicious strategies that exploit Twitter and other socialmedia platforms to promote shady objectives in times of crisis.
REFERENCES
ACM TISS
Journal of ComputationalSocial Science
3, 2 (2020), 271–277.[7] Disinformation Research Group. 2020. Spanish-language vaccinenews stories hosting malware disseminated via URL shorteners.https://fas.org/disinfoblog/spanish-language-vaccine-news-stories-hosting-malware-disseminated-via-url-shorteners/[8] Sameera Horawalavithana, Kin Wai Ng, and Adriana Iamnitchi. 2020. TwitterIs the Megaphone of Cross-platform Messaging on the White Helmets. In
In-ternational Conference on Social Computing, Behavioral-Cultural Modeling andPrediction and Behavior Representation in Modeling and Simulation . Springer,235–244.[9] Matthieu Latapy, Clémence Magnien, and Nathalie Del Vecchio. 2008. Basicnotions for the analysis of large two-mode networks.
Social networks
30, 1 (2008),31–48.[10] Marco Lui and Timothy Baldwin. 2012. langid. py: An off-the-shelf languageidentification tool. In
Proceedings of the ACL 2012 system demonstrations . 25–30.[11] Mark EJ Newman. 2003. The structure and function of complex networks.
SIAMreview
45, 2 (2003), 167–256.[12] Diogo Pacheco, Pik-Mai Hui, Christopher Torres-Lugo, Bao Tran Truong, Alessan-dro Flammini, and Filippo Menczer. 2020. Uncovering Coordinated Networks onSocial Media. arXiv preprint arXiv:2001.05658 (2020).[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python.
Journal of Machine Learning Research
12 (2011), 2825–2830.[14] Pacific Northwest National Laboratory PNNL. 2018. Socialsim. https://github.com/pnnl/socialsim.[15] Lisa Singh, Leticia Bode, Ceren Budak, Kornraphop Kawintiranon, Colton Padden,and Emily Vraga. 2020. Understanding high-and low-quality URL Sharing onCOVID-19 Twitter streams.
Journal of Computational Social Science
3, 2 (2020),343–366.[16] Kate Starbird, Ahmer Arif, and Tom Wilson. 2019. Disinformation As Collabora-tive Work: Surfacing the Participatory Nature of Strategic Information Operations.
Proc. ACM Hum.-Comput. Interact.
3, CSCW, Article 127 (Nov. 2019), 26 pages.https://doi.org/10.1145/3359229[17] Kate Starbird, Ahmer Arif, Tom Wilson, Katherine Van Koevering, Katya Yefi-mova, and Daniel Scarnecchia. 2018. Ecosystem or echo-system? Exploringcontent sharing across alternative media domains. In
AAAI ICWSM .[18] Yla R Tausczik and James W Pennebaker. 2010. The psychological meaning ofwords: LIWC and computerized text analysis methods.
Journal of language andsocial psychology