[PDF] Data-Driven Characterization and Detection of COVID-19 Themed Malicious Websites

Abstract

COVID-19 has hit hard on the global community, and organizations are working diligently to cope with the new norm of "work from home". However, the volume of remote work is unprecedented and creates opportunities for cyber attackers to penetrate home computers. Attackers have been leveraging websites with COVID-19 related names, dubbed COVID-19 themed malicious websites. These websites mostly contain false information, fake forms, fraudulent payments, scams, or malicious payloads to steal sensitive information or infect victims' computers. In this paper, we present a data-driven study on characterizing and detecting COVID-19 themed malicious websites. Our characterization study shows that attackers are agile and are deceptively crafty in designing geolocation targeted websites, often leveraging popular domain registrars and top-level domains. Our detection study shows that the Random Forest classifier can detect COVID-19 themed malicious websites based on the lexical and WHOIS features defined in this paper, achieving a 98% accuracy and 2.7% false-positive rate.

Full PDF

DData-Driven Characterization and Detection ofCOVID-19 Themed Malicious Websites

Mir Mehedi Ahsan Pritom ∗ , Kristin M. Schweitzer † , Raymond M. Bateman † , Min Xu ‡ and Shouhuai Xu ∗∗ Department of Computer Science, University of Texas at San Antonio † U.S. Army Research Laboratory South - Cyber ‡ Mastercard

Abstract —COVID-19 has hit hard on the global community,and organizations are working diligently to cope with the newnorm of “work from home”. However, the volume of remote workis unprecedented and creates opportunities for cyber attackersto penetrate home computers. Attackers have been leverag-ing websites with COVID-19 related names, dubbed

COVID-19 themed malicious websites . These websites mostly containfalse information, fake forms, fraudulent payments, scams, ormalicious payloads to steal sensitive information or infect vic-tims’ computers. In this paper, we present a data-driven studyon characterizing and detecting COVID-19 themed maliciouswebsites. Our characterization study shows that attackers areagile and are deceptively crafty in designing geolocation targetedwebsites, often leveraging popular domain registrars and top-level domains. Our detection study shows that the RandomForest classiﬁer can detect COVID-19 themed malicious websitesbased on the lexical and WHOIS features deﬁned in this paper,achieving a 98% accuracy and 2.7% false-positive rate.

Index Terms —COVID-19 Cyberattacks, Malicious Websites,Detection, Defense

I. I

NTRODUCTION

The COVID-19 pandemic has incurred many new cyber at-tack vectors. Many of these cyber attacks incorporate COVID-19 themed factors into phishing, malware, and scammingschemes for various malicious goals (e.g., monetary beneﬁts,stealing credentials, stealing credit card numbers, or identitytheft). For example, there is reportedly a 148% increase inransomware attacks in March 2020 compared with February2020 [1], where many attacks are initiated by maliciouswebsites abusing victims’ trust.This paper focuses on one emerging attack vector, namelymalicious websites leveraging COVID-19 as a theme or

COVID-19 themed malicious websites [2]. As organizationsincorporate the “work from home” policy, the consequences ofCOVID-19 themed malicious websites can be signiﬁcantly am-pliﬁed because home computers are often more vulnerable toattack than work computers. During the COVID-19 pandemic,many people lost their jobs and are affected by mental healthissues, which causes excessive pressures. These pressures maymake average users even more vulnerable to social engineeringattacks waged via COVID-19 themed malicious websites. Thisincreases the motivation of the importance of understandingand defending against COVID-19 themed malicious websites,which is a new problem that has not been studied before in asystematic way.

Our contributions . In this paper, make the following con-tributions. First, we propose a methodology for character-izing and detecting COVID-19 themed malicious websitesthrough a data-driven approach. To the best of our knowledge,this is the ﬁrst study on data-driven characterization anddetection of COVID-19 themed malicious websites. Second,we apply the methodology to speciﬁc datasets to draw thefollowing insights: (i) some attackers may be incentivizedto use cheaper registrars for registering COVID-19 themedmalicious websites; (ii) attackers often abuse popular top-leveldomains for their COVID-19 themed malicious websites; (iii)attackers are agile in waging the COVID-19 themed maliciouswebsite attack; (iv) attackers are crafty in using COVID-19themed keywords, and geographical information in creatingCOVID-19 themed malicious website domain names; (v) thesmall degree of data imbalance does not have any signiﬁcantimpact in the effectiveness of detecting COVID-19 themedmalicious websites; and (vi) COVID-19 themed maliciouswebsite detectors must consider WHOIS features and RandomForest performs better than K -nearest neighbor, decision tree,logistic regression, and support vector machine. Paper outline . The rest of the paper is organized as follows.Section II explores the related work. Section III explores theresearch questions which guide us to characterize and detectCOVID-19 themed malicious websites. Section IV reports theexperiments and results. Section V discuss our weakness andfuture research opportunities. Section VI concludes the paper.II. R

ELATED W ORKS

Although the problem of COVID-19 themed maliciouswebsites has not been investigated until now, the problem ofmalicious websites has been studied in the literature prior tothe COVID-19 pandemic. The problem of detecting maliciousURLs generated by domain generating algorithms has beeninvestigated in [3]. The problem of detecting phishing websiteshas been addressed via various approaches, including: thedescriptive features-based model [4], the lexical and HTMLfeatures-based model [5], the HTML and URL features-basedmodel [6], and the natural language processing and wordvector features-based model [7]. The problem of detectingmalicious websites has been addressed via the following ap-proaches: leveraging application and network layers informa-tion [8], leveraging image recognition [9], leveraging genericURL features [10], [11], leveraging character-level embedding a r X i v : . [ c s . CR ] F e b r keyword-based recurrent neural networks [12]–[14], the no-tion of adversarial malicious website detection [15]. However,these studies do not consider features pertinent to the COVID-19 pandemic, which are we leverage. Nevertheless, the presentstudy fall under the umbrella of cybersecurity data analytics[16]–[20], which in turn belong to the Cybersecurity Dynamicsframework [21]–[25].III. M ETHODOLOGY

Our methodology for data-driven characterization and de-tection of COVID-19 themed malicious websites is centeredat answering a range of research questions.

A. Characterization Methodology

In order to characterize COVID-19 themed malicious web-sites, we address 4 Research Questions (RQs): • RQ1: Which WHOIS registrars are most abused to launchCOVID-19 themed malicious websites? • RQ2: Which Top Level Domains (TLDs) are most abusedby COVID-19 themed malicious websites? • RQ3: What trends are exhibited by COVID-19 themedmalicious websites? • RQ4: Which theme keywords are mostly abused byattackers, and how?We consider WHOIS information because it has shown tobe useful in the era prior to the COVID-19 pandemic [8],[15]. Answering the preceding questions will deepen our un-derstanding of COVID-19 themed malicious website attacks.

B. Detection Methodology

We propose leveraging machine learning to detect COVID-19 themed malicious websites and answer: • RQ5: Which classiﬁer is competent in detecting COVID-19 themed malicious websites? • RQ6: What is the impact of WHOIS features on theclassiﬁer’s effectiveness?In order to answer these questions, we need to train de-tectors. Figure 1 highlights the methodology for detectingCOVID-19 themed malicious websites. The methodology canbe decomposed into the following modules: data collection,feature deﬁnition and extraction, data pre-processing, classiﬁertraining, and classiﬁer test.Data about websites need to be collected from reliablesources. The collected data may need enrichment to providemore information, as what will be illustrated in our case study.Then, features may be deﬁned to describe these websites. Inthe case of using deep learning (which requires much largerdatasets), features may be automatically learned. One mayconsider a range of classiﬁers, which are generically called C i ’s in Figure 1. As shown in Figure 1, one can use classiﬁersindividually or an ensemble of them (e.g., via a desired votingscheme, such as weighted vs. unweighted majority voting). Inthe simple form of unweighted majority voting, a website isclassiﬁed as malicious if majority of the classiﬁers predict itas malicious; otherwise, it is classiﬁed as benign. ...C C...C C TrainedClassifiers

Fig. 1. Methodology for detecting COVID-19 themed malicious websites

In order to evaluate the effectiveness of the trained clas-siﬁers, we propose adopting the standard metrics, including:accuracy (ACC), false-positive rates (FPR), false-negative rates(FNR), and F -score. Speciﬁcally, let T P be the number oftrue positives,

T N be the number of true negatives,

F P bethe number of false positives, and

F N be the number offalse negatives. Then, we have ACC = T P + T NT P + T N + F P + F N , FPR = F PF P + T N , FNR = F NF N + T P , and F -score = T P T P + F P + F N .IV. C

ASE S TUDY

Our case study applies the methodology to speciﬁc datasets.

A. Data Collection

Our dataset of COVID-19 malicious website examples areobtained from what was published between 2/1/2020 and5/15/2020 by two sources: (i) CheckPhish [26], which contains131,761 malicious websites waging scamming attacks relatedto COVID-19; and (ii) DomainTools [27], which contains157,579 malicious websites waging malware, phishing, andspamming attacks related to COVID-19. The union of thesetwo sets leads to a total of 221,921 malicious websites, denotedby D malicious , owing to the fact that 67,419 websites belongto both sets. For obtaining benign websites, we use the top250,000 websites from Cisco’s Umbrella 1 million websitesdataset [28] on 05/16/2020, denoted by D benign , which is asource of reputable websites. We compile a merged datasetdenoted by D initial = D malicious ∪ D benign .In order to collect WHOIS information of a website, weuse the python library whois 0.9.7 to query the WHOISdatabase on 8/7/2020. We observe that 42,540 (or 19.17%)out of the 221,921 malicious websites have no WHOIS infor-mation available, and 93,082 (or 37.2%) out of the 250,000benign websites have no WHOIS information available. Thismeans that the presence/absence of WHOIS information doesnot indicate that a website is malicious or not. B. Characterization Case Study1) Answering RQ1: Identifying the WHOIS registrars thatare most abused to launch COVID-19 themed malicious web-sites:

For this purpose, we use a subset of D malicious set,denoted by D (cid:48) malicious , which contains 171,901 maliciouswebsites with WHOIS registrar name information available.2 Fig. 2. Top 10 abused WHOIS registrars of COVID-19 themed maliciouswebsites (the y -axis is in the log-scale). Figure 2 depicts the top 10 abused registrars, which areranked according to the absolute number of COVID-19 themedwebsites in D (cid:48) malicious that are respectively registered by them.We observe that Godaddy is the most frequently abusedregistrar, followed by

Google and

Namecheap . This ﬁndinginspires us to analyze if there is any ﬁnancial incentive behindthe use of a speciﬁc registrar. The cost registering a .com domain in the ﬁrst year, is:

Godaddy for $11.99,

Google for $9,

Namecheap for $8.88,

Dynadot for $8.99, for$1, name.com for $8.99,

PDR Ltd for $35,

OVH for $8.28,

Alibaba for $7.99,

Reg-ru for $28. This suggests thatsome attackers might have considered registrar becauseit is the cheapest, while some attackers use reputed registrars.

Insight Some attackers may be incentivized to usecheaper registrars but some of the other don’t.

2) Answering RQ2: Which Top Level Domains (TLDs) aremost abused by COVID-19 themed malicious websites?:

Inorder to answer this question, we use the original dataset D malicious , which contains 221,921 COVID-19 themed mali-cious websites with corresponding TLD information. Fig. 3. Top 10 abused TLDs of COVID-19 themed malicious websites (the y -axis is in the log-scale). Figure 3 depicts the top 10 abused TLDs, which are rankedaccording to the absolute number TLDs for COVID-19 themedmalicious websites. We make the following observations.First, .com hosts the highest number of malicious websites,followed by .org and .net . Second, 5 of the top 10 abusedTLDs correspond to country-level ccTLDs , including .de , .uk , .ru , .nl and .eu . Insight Attackers often abuse popular TLDs.

3) Answering RQ3: What trends are exhibited by COVID-19 themed malicious websites?:

In order to answer thisquestion, we use the dataset D malicious mentioned above.Figure 4 depicts the trend of malicious websites, leadingto two observations. First, there is a discrepancy betweenthe daily numbers of websites that are reported by the twosources. According to CheckPhish, the number of COVID-19themed malicious websites reaches the peak on 03/25/2020,with 18,495 malicious websites; according to DomainTools,the number of COVID-19 themed malicious websites reachesa peak on 03/20/2020, with 3,981 malicious websites. Thisdata indicates that there are reporting inconsistencies amongsources and many COVID-19 themed malicious websites arecreated at the early stage of the pandemic when uncertainties are maximum. Second, the number of COVID-19 themedmalicious websites, by and large, has been decreasing since thelast week of March 2020 (i.e., two weeks after the pandemicdeclaration), leading to about 1,000 websites per day duringthe ﬁrst week of May 2020 (i.e., about two months afterpandemic declaration). However, there is still oscillation. Onepossible cause is that the attackers have been waiting tocreate new COVID-19 themed malicious websites based onthe pandemic’s new developments (e.g., vaccine). Fig. 4. Trends of COVID-19 themed malicious website.

Insight Inconsistencies in reporting mechanisms, attack-ers are agile in creating COVID-19 themed malicious websites.

4) Answering RQ4: Which theme keywords are mostlyabused by attackers, and how?:

In order to answer thisquestion, we analyze the dataset D malicious mentionedabove. We use the python library wordninja withEnglish Wikipedia language model [29] to split domainname strings and extract COVID-19 themed keywords.We observe that 4 keywords (i.e., covid , corona , covid19 ,and coronavirus ) are most widely used as expected;they are followed by mask , quarantine , virus , test , facemask , pandemic , and vaccine . We extract more than19,000 keywords. A further analysis of the domain namesreveals that attackers create COVID-19 themed maliciouswebsites with names containing geographical attributes. Forexample, coronaviruspreventionsanantonio.com , coronavirusprecentionhouston.com , and coronaviruspreventiondallas.com use acombination of city name and a COVID-19 themed keyword.Moreover, we observe the existence of COVID-19 themed3parking” websites, which have no content at the presenttime but might be used for upcoming COVID-19 themes. Insight Attackers are crafty in using COVID-19 themedkeywords and geographical information in creating COVID-19themed malicious website domain names.

C. Detection Case Study

Given D initial , the detection case study proceeds as follows.

1) Feature Deﬁnition and Extraction:

We deﬁne featuresaccording to the following aspects of websites: WHOIS (F1-F4), domain name lexical information (F5-F9), statistical in-formation (F10), and Top-Level Domain or TLD (F11). • Current WHOIS registration lifetime (F1): This is thenumber of days that has passed since a website’s regis-tration, with respect to the date when this feature’s valueis extracted (e.g., 08/07/2020 in our case). • Remaining WHOIS expiration lifetime (F2): This is thenumber of remaining days before a website’s WHOISregistration expires, with respect to the date when thisfeature’s value is extracted (e.g., 08/07/2020 in our case). • Number of days since last WHOIS update (F3): Thisis the number of days elapsed since a website’s lastupdate with respect to the date when this feature’s valueis extracted (e.g., 08/07/2020 in our case). • WHOIS registrar reputation (F4): We propose measuringa WHOIS registrar’s reputation as n | D benign | , where n is the number of benign websites in D benign that areregistered by this particular registrar and | D benign | is thesize of set D benign . • Number of dots in domain name (F5): This is the numberof dots (character ‘.’) in the domain name. For example,domain any.com has 1 dot. • Domain hyphen count (F6): This is the number of hy-phens (‘-’) in a domain name. • Domain vowel count (F7): This is the number of vowels(i.e., a , e , i , o , u ) in a domain name. • Domain digits percentage (F8): This is the ratio of thenumber of digits (0-9) in a domain name to the numberof characters including digits. • Domain unique alphabetic-numeric characters count (F9):This is the total number of unique alphabetic and numericcharacters (i.e., a-z, A-Z, 0-9) in a domain name. • Domain entropy (F10): This is the Shannon entropy [30]of the domain name (i.e., a kind of statistical infor-mation), which is computed based on the frequency ofcharacters in the domain name. • TLD Reputation (F11): We propose measuring a TLD’sreputation as m | D benign | , where m is the number of web-sites in D benign that contain this particular TLD.

2) Data Pre-Processing:

Given that some websites may nothave information for the features, it is important to considerdifferent scenarios. In our example, we propose consideringtwo datasets that can be derived from D initial because somewebsites do not have information for the WHOIS features. • Dataset D ⊂ D initial consists of websites for whichWHOIS information is available (i.e., features F1-F4 are available). D contains 21,749 websites in total, including16,411 COVID-19 themed malicious websites and 5,338benign websites. • Dataset D ⊂ D initial , where D ∩ D = ∅ , consists ofwebsites for which WHOIS information is absent (i.e.,features F1-F4 are entirely missing). D contains 135,621websites, including 42,540 malicious websites and 93,081benign websites. For each website belonging to D , onlyvalues of the 7 features (i.e., F5-F11) are available. TABLE IR

ELATIVE IMPORTANCE OF FEATURES IN D WITH RESPECT TO THERANDOM FOREST METHOD .Feature Importance Feature ImportanceF1 0.429 F7 0.080F2 0.094 F8 0.009F3 0.131 F9 0.028F4 0.065 F10 0.029F5 0.065 F11 0.068F6 0.003

Since only D contains all WHOIS information, We useit for feature selection study. For this purpose, we use the random forest classiﬁcation feature importance method [31](with the 80-20 splitting of training-test data) to ﬁnd theimportant features. Table I depicts the relative importance ofthe features in D . We observe that F6 and F8 have a verysmall relative importance (i.e., < . ) when compared to theothers, suggesting that hyphens and digits are equally used inmalicious or benign domain names. Hence, we will eliminateF6 and F8 in the rest study of D .In order to see whether or not the feature selection re-sult is impacted by the data imbalance of D (with themalicious:benign ratio being 3.1:1), we explore two widely-used methods: (i) oversampling the minority class to replicatesome random examples; and (ii) undersampling the majorityclass to remove some random examples. At ﬁrst, we dothe 80-20 splitting of training-test data, and then change themalicious:benign ratio in the training set, while keeping thetest set intact. We wish to identify the ratio that achieves thehighest F -score. In what follows we only report the resultsof Random Forest because it outperforms the other classiﬁersfor the original dataset D .Table II shows the impacts of the malicious:benign ratioin the training set. We observe that the oversampling-incurredratio 1.67:1 leads to the highest F -score (and the second bestFPR and lowest FNR), while undersampling never performsbetter than the original data ratio in terms of accuracy and F -score. This can be explained by the fact that the lattereliminates useful information. This prompts us to use over-sampling to achieve the 1.67:1 ratio when training classiﬁers,which turns D into D (cid:48) (i.e., the training set is augmented).Figure 5 further highlights the confusion matrix of theexperiment one the same test set but corresponding to D and D (cid:48) , which shows a slight improvement in detection whenaugmenting the training set with oversampling.4 ABLE III

MPACT OF THE MALICIOUS : BENIGN RATIO ON THE EFFECTIVENESS OFTHE R ANDOM F OREST CLASSIFIER WITH

Oversampling

AND

Undersampling , WHERE D WITH RATIO

IS THE ORIGINAL D .Dataset Method Ratio

ACC FPR FNR F -score D (none) 3.1:1 0.980 0.030 0.017 0.987 D Oversample 2:1 0.980 0.030 0.018 0.986 D Oversample 1.67:1 0.980 0.027 0.017 0.988 D Oversample 1.43:1 0.979 0.028 0.019 0.986 D Oversample 1.25:1 0.979 0.028 0.018 0.986 D Oversample 1.11:1 0.979 0.027 0.019 0.986 D Oversample 1:1 0.979 0.026 0.019 0.986 D Undersample 2:1 0.977 0.023 0.022 0.985 D Undersample 1.67:1 0.976 0.023 0.025 0.984 D Undersample 1.43:1 0.975 0.023 0.025 0.984 D Undersample 1.25:1 0.972 0.020 0.031 0.981 (a) Malicious:Benign (3.1:1) (b) Malicious:Benign (1.67:1)

Predicted Class Predicted Class A c t ua l C l a ss Fig. 5. Confusion matrix for (a) D with 3.1:1 malicious:benign ratio in thetraining data and (b) D (cid:48) with 1.67:1 ratio in the training data. Insight The data imbalance issue does not affect themodel performance signiﬁcantly in this case, perhaps becausethe degree of imbalance is not severe enough.

3) Training and Test:

Having addressed the issue of featureselection and data imbalance, we consider the following clas-siﬁers: Random Forest (RF), Decision Tree (DT), Logistic Re-gression (LR), K -Nearest Neighbor (KNN), and Support Vec-tor Machine (SVM). Speciﬁcally, we use the python sklearn module to import the following classiﬁer algorithms: (i)Random Forest or RF with parameters n_estimator = (i.e., 100 trees in a forest) and criterion = ‘entropy’ (i.e.,entropy is used to measure information gain); (ii) K -NearestNeighbor or KNN, with parameters n_neighbors = (i.e.,8 of neighbors are considered), metric = ‘minkowski’ with p = 2 (i.e., the Minkowski metric with p = 2 measuresthe distance between two feature vectors), and the rest pa-rameters are the default values; (iii) Decision Tree or DTwith default parameters; (iv) Logistic Regression or LR withdefault parameters; (v) Support Vector Machine or SVM with linear kernel and other default parameters. For voting theoutputs of the ﬁve classiﬁers mentioned above, we use the VotingClassifier() function and set voting = ‘hard’ (i.e., majority voting). We always considering the 80-20 split-ting of the scaled training-test data.

4) Answering RQ5 and RQ6:

In order to answer RQ5 andRQ6, we conduct the following experiments, where we use the80-20 train-test splitting of D • Experiment (Exp.) 1: Use the lexical, statistical, and TLDfeatures (i.e., F5, F7, F9-F11) only, while ignoring theWHOIS features. (This experiment is equally applicableto D , which is not reported owing to space limitation.) • Experiment (Exp.) 2: Use the WHOIS features (i.e., F1-F4), while ignoring all other features. • Experiment (Exp.) 3: Use both lexical and WHOIS fea-tures (i.e., F1-F5, F7, F9-F11).

TABLE IIIE

XPERIMENTAL RESULTS ON DATASET D (cid:48) WITH A RANGE OF CLASSIFIERS ( WITH OVERSAMPLING ), THEIR TOTAL

CPU

TIMES FOR TRAINING ANDTEST : E XP . 1 USES LEXICAL FEATURES ONLY ; E XP .2 USES

WHOIS

FEATURES ONLY ; E XP . 3 USES BOTH LEXICAL AND

WHOIS

FEATURES .Exp. Classiﬁer ACC FPR FNR F -score ExecutionTime(s)1 RF 0.924 0.150 0.052 0.950 0.482 RF 0.977 0.025 0.023 0.985 0.593 RF 0.980 0.027 0.017 0.988 0.641 KNN 0.887 0.199 0.086 0.925 0.402 KNN 0.949 0.034 0.056 0.966 0.253 KNN 0.947 0.031 0.060 0.964 0.301 DT 0.917 0.151 0.061 0.945 0.072 DT 0.973 0.045 0.022 0.982 0.083 DT 0.974 0.051 0.019 0.983 0.141 LR 0.885 0.216 0.082 0.924 20.302 LR 0.883 0.362 0.038 0.926 23.033 LR 0.918 0.178 0.051 0.946 44.401 SVM 0.888 0.220 0.078 0.925 1.692 SVM 0.881 0.373 0.038 0.924 1.683 SVM 0.920 0.164 0.054 0.946 2.381 Ensemble 0.916 0.171 0.056 0.945 21.402 Ensemble 0.962 0. 031 0.041 0.974 24.753 Ensemble 0.970 0.035 0.028 0.980 45.70 Table III summarizes the experimental results with a rangeof classiﬁers and the actual time spent on training a modeland classifying the entire test set. We make several observa-tions. First, for a speciﬁc classiﬁer, using WHOIS featuresalone (Exp. 2) almost always leads to signiﬁcantly highereffectiveness than using lexical features alone (Exp. 1), exceptfor Logistic Regression. Second, for a ﬁxed classiﬁer, usingboth lexical and WHOIS features together (i.e., Exp. 3)always performs better than using lexical or WHOIS featuresalone. Third, among the classiﬁers considered, Random Forestperforms the best in every metric in each experiment. Inparticular, Random Forest (i.e., non-linear classiﬁer) achievesa better performance than the Ensemble method becausethere are classiﬁers (e.g., Logistic Regression and SVM) thatare substantially less accurate than the other classiﬁers andtherefore “hurt” the voting results. Fourth, Decision Tree hasthe fastest execution time, followed by KNN and RandomForest, while Logistic Regression is the slowest and causesa delay for the voting ensemble. To understand the gener-alizability, when conducting Exp. 1 on the augmented D (cid:48) with the benign:malicious ratio at 1.25:1, we observe that5andom Forest outperforms other models by achieving a 0.947accuracy, a 0.066 FPR, a 0.041 FNR, and a 0.947 F1-score. Insight COVID-19 themed malicious website detectorsmust consider WHOIS features; and Random Forest performsthe best among the classiﬁers that are considered.V. D

ISCUSSION

The present study has several limitations, which should beaddressed in future studies. First, we use a heuristic methodto determine the ground truth. This heuristic method can onlyapproximate the ground truth because the data sources (i.e.,CheckPhish and DomainTools feeds in this case) may containsome errors. Second, we could not avoid the data imbalanceproblem, meaning that the resulting detectors or classiﬁers maybe slightly biased towards the majority class even after theoversampling. Third, we only considered the WHOIS and URLlexical features, but not the website contents or the networklayer features, Fourth, we only considered ﬁve WHOIS fea-tures because most of the other kinds of WHOIS informationare largely missing, which means that WHOIS registrars needto collect more detailed information than what is presentedat the moment of writing. Fifth, application of deep learningmodels or explainable ML are left to future research. Sixth,we observe that the python library wordninja can makebad splits at times (e.g., when a domain name is seemingly inEnglish characters but actually in another languages).VI. C

ONCLUSION

We have presented the ﬁrst systematic study on data-driven characterization, and detection of COVID-19 themedmalicious websites. We presented a methodology and appliedit to a speciﬁc dataset. Our experiments led to several insights,highlighting that attackers are agile , crafty , economically in-centivized in waging COVID-19 themed malicious websitesattacks. Our experiments show that Random Forest can serveas an effective detector against these attacks, especially whenWHOIS information about websites in question is available.This highlights the importance of domain registrars to collectmore information when registering domains in future. Acknowledgement . We thank the reviewers for their usefulcomments. This work was supported in part by ARO Grant

Proc. IEEE ICEI , 2019, pp. 487–492.[4] O. Christou, N. Pitropakis, P. Papadopoulos, S. McKeown, and W. J.Buchanan, “Phishing url detection through top-level domain analysis: Adescriptive approach,” in

ICISSP , 2020.[5] M. Chatterjee and A. Namin, “Detecting phishing websites through deepreinforcement learning,” in

Proc. IEEE COMPSAC , 2019, pp. 227–232. [6] Y. Li, Z. Yang, X. Chen, H. Yuan, and W. Liu, “A stacking model usingurl and html features for phishing webpage detection,”

Future Gener.Comput. Syst. , vol. 94, pp. 27–39, 2019.[7] O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learningbased phishing detection from urls,”

Expert Systems with Applications ,vol. 117, pp. 345 – 357, 2019.[8] L. Xu, Z. Zhan, S. Xu, and K. Ye, “Cross-layer detection of maliciouswebsites,” in

Third ACM Conference on Data and Application Securityand Privacy (CODASPY’13) , 2013, pp. 141–152.[9] D. Liu, J. Lee, W. Wang, and Y. Wang, “Malicious websites detection viacnn based screenshot recognition*,” in

International Conf. on IntelligentComputing and its Emerging Applications , 2019, pp. 115–119.[10] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Learning to detectmalicious urls,”

ACM TIST , vol. 2, no. 3, pp. 30:1–30:24, 2011.[11] H. M. Junaid Khan, Q. Niyaz, V. K. Devabhaktuni, S. Guo, andU. Shaikh, “Identifying generic features for malicious url detectionsystem,” in

Proc. IEEE UEMCON , 2019, pp. 0347–0352.[12] R. Verma and A. Das, “What’s in a url: Fast feature extraction andmalicious url detection,” in

Proc. ACM IWSPA’17 , 2017, p. 55–63.[13] F. D. Abdi and L. Wenjuan, “MALICIOUS URL DETECTION USINGCONVOLUTIONAL NEURAL NETWORK,” Dec. 2017. [Online].Available: https://doi.org/10.5281/zenodo.1155304[14] W. Yang, W. Zuo, and B. Cui, “Detecting malicious urls via a keyword-based convolutional gated-recurrent-unit neural network,”

IEEE Access ,vol. 7, pp. 29 891–29 900, 2019.[15] L. Xu, Z. Zhan, S. Xu, and K. Ye, “An evasion and counter-evasionstudy in malicious websites detection,” in

Proc. IEEE CNS , 2014, pp.265–273.[16] J. Mireles, E. Ficke, J. Cho, P. Hurley, and S. Xu, “Metrics towardsmeasuring cyber agility,”

IEEE T-IFS , vol. 14, no. 12, pp. 3217–3232,2019.[17] Z. Zhan, M. Xu, and S. Xu, “Characterizing honeypot-captured cyberattacks: Statistical framework and case study,”

IEEE T-IFS , vol. 8, no. 11,2013.[18] ——, “Predicting cyber attack rates with extreme values,”

IEEE T-IFS ,vol. 10, no. 8, pp. 1666–1677, 2015.[19] Y. Chen, Z. Huang, S. Xu, and Y. Lai, “Spatiotemporal patterns andpredictability of cyberattacks,”

PLoS One , vol. 10, no. 5, p. e0124472,05 2015.[20] M. Xu, K. M. Schweitzer, R. M. Bateman, and S. Xu, “Modeling andpredicting cyber hacking breaches,”

IEEE T-IFS , vol. 13, no. 11, pp.2856–2871, 2018.[21] S. Xu, “Cybersecurity dynamics: A foundation for the science ofcybersecurity,” in

Proactive and Dynamic Network Defense , 2019, pp.1–31.[22] R. Zheng, W. Lu, and S. Xu, “Preventive and reactive cyber defensedynamics is globally stable,”

IEEE TNSE , vol. 5, no. 2, pp. 156–170,2018.[23] H. Chen, J. Cho, and S. Xu, “Quantifying the security effectiveness ofﬁrewalls and dmzs,” in

Proc. HoTSoS’2018 , 2018, pp. 9:1–9:11.[24] M. Pendleton, R. Garcia-Lebron, J. Cho, and S. Xu, “A survey onsystems security metrics,”

ACM Comput. Surv. , vol. 49, no. 4, pp. 62:1–62:35, 2016.[25] H. Chen, J. Cho, and S. Xu, “Quantifying the security effectiveness ofnetwork diversity,” in

Proc. HoTSoS’2018 et al. , Applied predictive modeling . Springer,2013, vol. 26.. Springer,2013, vol. 26.