[PDF] Look Before You Leap: Detecting Phishing Web Pages by Exploiting Raw URL And HTML Characteristics

Abstract

Cybercriminals resort to phishing as a simple and cost-effective medium to perpetrate cyber-attacks on today's Internet. Recent studies in phishing detection are increasingly adopting automated feature selection over traditional manually engineered features. This transition is due to the inability of existing traditional methods to extrapolate their learning to new data. To this end, in this paper, we propose WebPhish, a deep learning technique using automatic feature selection extracted from the raw URL and HTML of a web page. This approach is the first of its kind, which uses the concatenation of URL and HTML embedding feature vectors as input into a Convolutional Neural Network model to detect phishing attacks on web pages. Extensive experiments on a real-world dataset yielded an accuracy of 98 percent, outperforming other state-of-the-art techniques. Also, WebPhish is a client-side strategy that is completely language-independent and can conduct lightweight phishing detection regardless of the web page's textual language.

Full PDF

LLook Before You Leap: Detecting Phishing Web Pages byExploiting Raw URL And HTML Characteristics

Chidimma Opara

Teesside UniversityMiddlesbrough, United Kingdom

Yingke Chen

Teesside UniversityMiddlesbrough, United Kingdom

Bo Wei

Northumbria UniversityNewcastle, United Kingdom

ABSTRACT

Cybercriminals resort to phishing as a simple and cost-effectivemedium to perpetrate cyber-attacks on today’s Internet. Recentstudies in phishing detection are increasingly adopting automatedfeature selection over traditional manually engineered features.This transition is due to the inability of existing traditional meth-ods to extrapolate their learning to new data. To this end, in thispaper, we propose WebPhish, a deep learning technique using auto-matic feature selection extracted from the raw URL and HTML of aweb page. This approach is the first of its kind, which uses the con-catenation of URL and HTML embedding feature vectors as inputinto a Convolutional Neural Network model to detect phishing at-tacks on web pages. Extensive experiments on a real-world datasetyielded an accuracy of 98 percent, outperforming other state-of-the-art techniques. Also, WebPhish is a client-side strategy thatis completely language-independent and can conduct lightweightphishing detection regardless of the web page’s textual language.

KEYWORDS

Phishing detection, Web pages, Deep Neural Networks, HTML,URL.

Recently, phishing has become a go-to type of attack for cyber-criminals because it is cost-effective and requires little technicalknowledge [1]. A phishing attack is launched primarily throughspam email. Links often embedded within these emails lead tophishing web pages. Gmail daily intercepted over 100 million spamemails alone in April 2020, 18 million of which were COVID-19pandemic phishing attacks . The scale of this cyber-attack hasnecessitated the attention of industrial and academic experts.Several techniques have been proposed to combat phishing at-tacks on desktop [2–4], and mobile web pages. [5]. The basis of mostphishing detection schemes is on the underlying technology usedfor the phishing website identification. The techniques are majorlyclassified as search engine based [6], [7], [8], statistic machine learn-ing based [4], phishing blacklist and whitelist based [9], [10], andvisual similarity based methods [6], [11]. Research has shown thatsearch engine based techniques are fleetly and less computationalexhaustive. Despite the speed offered by search engine-based tech-niques, they are nevertheless dependent on third-party applications.They are also prone to high false positive rates (FPR).Consequently, the use of machine learning methods has becomepopular. This popularity is because of its independence from ex-ternal systems and its ability to detect zero-day phishing attacks, which reduces FPR. Most existing machine learning-based phishingdetection methods, such as the works proposed by [12] and [13],are built using manually engineered features from the prominentcomponents of a web page like the URL, HTML, and Network mod-ules. These features fed into the machine learning classificationmodel can be discrete or binary variables. Although these tech-niques have proven successful, they have some limitations. Manualfeature engineering techniques can be tedious. They require spe-cialized domain knowledge to establish the features that will beuseful to a specific platform. Also, models built on manual featureshave difficulties accommodating new data. Therefore, they cannotdetect phishing web pages with updated content and structure. Thischallenge necessitates the regular upgrading of the feature set.Motivated by the above challenges, we propose WebPhish, anend-to-end deep neural network that takes advantage of the benefitsof using both the URL and HTML content in its raw form to detecta phishing attack. We employ an embedding technique to initiatean automatic feature extraction of the corresponding charactersinto homologous dense vectors. Subsequently, the concatenationlayer merges the URL and HTML embedding matrices. Then, aDeep Neural Network (DNN), specifically a Convolutional NeuralNetworks (CNNs), is used to model its semantic dependencies.The main contributions of this work are as follows: • Different from existing methods, our proposed model, WebPhish,to the best of our knowledge, is the first to use a concatena-tion of the raw content of the HTML and URL to determinethe maliciousness of a web page using deep neural networks.Automatic feature selection is applied while the Convolu-tional Neural Networks learn semantic dependencies in theinput features’ temporal process. • Using a robust automated feature selection technique, theproposed work reduces the difficulties faced by existing sys-tems based on manually engineered features, such as thelack of flexibility of these systems to accommodate new dataand the need for specialized domain knowledge. • Furthermore, as there are a limited number of characters, theuse of character-level features in the proposed model enablesthe embedding feature vectors to generalize to new textualdata. This technique ensures that WebPhish can detect zero-day phishing attacks. • We conduct extensive analysis on a real-world dataset ofmore than 50,000 URLs and HTML documents collected overtwo months. The distribution of the instances in the cor-pus reflects the ratio of phishing and legitimate web pagesobtainable on the Internet. This approach ensures that ourevaluation metrics and results are extendable to existingreal-life systems. a r X i v : . [ c s . CR ] N ov Experimental results show that the proposed model signifi-cantly outperforms state-of-the-art methods demonstratingthe validity of our approach.We organized the remainder of the paper as follows: the nextsection provides an overview of related works on proposed tech-niques of detecting phishing on web pages. Section 3 provides anin-depth description of our proposed model. Section 4 elaborateson the dataset collection and evaluation metrics used to analyzeWebPhish. The detailed results on the evaluations of our proposedmodel are in Section 5. Finally, we conclude our paper in Section 6.

This section reviews the most common technologies used for phish-ing detection, specifically; phishing detection methods using list-based methods, statistical machine learning based on manuallyfeature design, and automatic feature extraction using deep neuralnetworks.

The list-based methods reviewed in this section use the whitelist oflegitimate websites and the blacklist of unverified websites to de-tect phishing. The blacklist is accessed through user reviews or bythird parties who use one of the other phishing mechanisms to con-duct Phishing URL identification. However, the machine learningmethods described below extract malignant and legitimate web-sites features from either text, image, or URL-specific content. Thenthese features use a group of algorithms and specified thresholdsto determine the maliciousness of the web page.[14] used natural language processing techniques to analyze thesemantic meaning of a given sentence to detect a social engineer-ing attack. Their approach, named SEAHound, analyses a givendocument to check for signs of phishing attacks such as urgency inthe tone of the message, malicious link, or a generic greeting. [15]proposed a phishing detection technique that extracts 212 featuresfrom the URL, web page content and the registered domain namecomponents of a web page. The extracted features trained with aGradient Boosting classifier determines a given web page’s legalstatus. Likewise, [16] proposed an online phishing detection system,PEDS, composed of a reinforcement learning agent trained on 50features extracted from the URL, HTML content, and email bodyand header. The proposed model can mitigate the problem of alimited dataset using an updated offline database. [17] proposed amachine learning-based phishing detection approach that extractsclient-side features from the URL and HTML content of a web page.Their approach yielded 99.09 percent accuracy on a random forestclassifier using a dataset of 2,141 phishing and legitimate web pages.The authors [18] proposed a whitelist-based method that relieson a centralized architecture. The method further compares thehyperlink in the HTML source code to verify the presence of nulllinks, empty hyperlinks, and external links to determine the webpage’s maliciousness. Also, Google provides a Safe-browsing ap-plication that allows the browser to verify the URLs using a listof suspicious domains, which is regularly updated by Google [19]. Although the list-based methods tend to keep the FPR low, a sig-nificant shortcoming is that the lists are not exhaustive and fail todetect zero-day attacks.

This section’s proposed techniques take as input raw URL or HTMLand apply the extracted features to a deep neural network to deter-mine a web page’s legality.[20] designed a model that receives the raw URL as an input,transforms it into a one-hot encoded vector, and applies LSTM unitsto determine if the URL is phishing. The results yielded an accu-racy of 98.7 percent accuracy on a corpus of 2million phishing andlegitimate URLs. Albeit [21] transformed the raw URLs into wordembeddings, and then Convolutional filters where implemented.[22] proposed a model named URLNet, built with a concatenationof convolutional neural networks applied on character and wordembedding matrices generated from the input URL. Also, [23] pro-posed HTMLPhish, a deep neural network model that takes asinput only raw HTML content and uses both character and wordembedding techniques. The concatenation of these two embeddingtechniques represents the features of each HTML document. Subse-quently, Convolutional layers were applied to model the semanticdependencies. HTMLPhish yielded an accuracy of 94 percent on adataset of over 50,000 instances.Despite the similarity between the existing techniques discussedin this section and our proposed model, WebPhish, there are stillsome significant differences and contributions. Current approachesuse either only the URL or HTML of a web page input to the network.However, WebPhish can exploit the benefits of using both URL andHTML in their raw form while maintaining impressive performanceand computational costs even on an imbalanced dataset.

In this section, we elaborate on the architecture of our proposeddeep neural network model, WebPhish. Deep learning techniqueshave been successful in a lot of Natural Language Processing (NLP)tasks, for example, in document classification [24], machine trans-lation [25], etc. The extensive application of Recurrent neural net-works (e.g., LSTM [26]) is due to their ability to exhibit temporalbehaviour and capture sequential data. However, CNN is best suitedfor text classification and sentiment analysis, as CNN learns to rec-ognize patterns across space [27].We define the problem of detecting phishing web pages usingtheir URL and HTML content as a binary classification task forprediction of two classes: legitimate or phishing . Given a datasetwith R web pages {( 𝑈 𝑅𝐿 , 𝐻𝑇 𝑀𝐿 , 𝑦 ) , . . . , ( 𝑈 𝑅𝐿 𝑅 , 𝐻𝑇 𝑀𝐿 𝑅 , 𝑦 𝑅 )} ,where 𝑈 𝑅𝐿 𝑟 and 𝐻𝑇 𝑀𝐿 𝑟 for r = 1, . . . , R represents the URLand HTML content of the 𝑟 th web page from the dataset, while 𝑦 𝑟 ∈ { , } is its label. 𝑦 𝑟 = corresponds to a phishing HTMLcontent while 𝑦 𝑟 = is a legitimate HTML content. As detailed in Figure 2a, WebPhish is a deep neural network com-prised of the following layers: 1. Input layer 2. Embedding layer3. CNN layer 4. Fully Connected (FC) layer 5. Sigmoid layer. We mploy CNN kernels to learn the temporal relations in the inputfeatures for the web page classification. We also apply an Embed-ding Layer to extract useful features from the HTML content. Atthe same time, the FC layers serve as an additional layer to extractother relevant characteristics. Finally, the sigmoid layer outputsthe results of the deep neural network model.Table 3 shows the configuration of the layers of the proposeddeep neural network model. The output dimension of the embed-ding layer, the kernel size, and filters in the CNN layer and thenumber of units in the FC layers are detailed. HTML Char EmbeddingURL Char Embedding < ! D O

Sequence of IntegersEmbedding Matrix

Figure 1: Configuration of the Embedding Layer in WebPhish

One of the main advantages of our proposedmodel is its capacity to function maximally using unprocessed data.Taking raw URL and HTML content as input, we conduct tokeniza-tion on the input data and segment the strings into character tokens.An index is then associated with each token from a finite dictio-nary M . By counting the number of unique characters, includingpunctuation marks in the URL and HTML corpus, we determined M . We obtained 𝑀 𝑈 𝑅𝐿 = unique characters for the URL corpusand 𝑀 𝐻𝑇 𝑀𝐿 = unique characters for the HTML corpus. As an index associated with each charactermapped using the finite dictionary for the URL corpus 𝑀 𝑈 𝑅𝐿 andthe HTML corpus 𝑀 𝐻𝑇 𝑀𝐿 does not contain many valuable data, thecharacter embedding matrix subsequently aligns each of these in-dexes into a feature vector. Specifically, for each input, the raw datais processed into character embedding matrices made up of charac-ter level feature representations. The embedding matrices whichare randomly initialised are gradually modified during training bybackpropagation, where they are structured into a vector space thatis relevant to the phishing detection model, which are exploitedby the Convolutional layers. Therefore,

𝑈 𝑅𝐿 𝑒𝑚 → s 𝜖 R 𝑈 𝑅𝐿 𝑙𝑒𝑛 × 𝑑 and 𝐻𝑇 𝑀𝐿 𝑒𝑚 → s 𝜖 R 𝐻𝑇 𝑀𝐿 𝑙𝑒𝑛 × 𝑑 where 𝑈 𝑅𝐿 𝑙𝑒𝑛 and

𝐻𝑇 𝑀𝐿 𝑙𝑒𝑛 arethe lengths of the sequences of each URL and HTML instance re-spectively while d is the dimension of the embedding matrix. Weexperimental selected 𝑈 𝑅𝐿𝑙𝑒𝑛 = and 𝐻𝑇 𝑀𝐿𝑙𝑒𝑛 = , while d = 16. Figure 1 shows the process in the embedding layer of theproposed model.We chose to use character embeddings for our model instead ofword embedding because of some inherent challenges with wordembedding techniques. For instance, word embedding techniquescannot extrapolate their learning on unfamiliar words. The num-ber of unique words depends on the given dataset. The character embedding technique efficiently handles these limitations becausethere is a finite number of characters and punctuation marks avail-able. This attribute enables the character embedding technique toextract patterns on unfamiliar words. The Convolutional layers follow the Charac-ter Embedding layer. Using all URL matrix (for all URLs

𝑈 𝑅𝐿 𝑟 ∀ 𝑟 = , ..., 𝑅 ) and HTML matrix (for all 𝐻𝑇 𝑀𝐿 𝑟 ∀ 𝑟 = , ..., 𝑅 ) as trainingdata, we can now add convolutional layers. We applied Convo-lutional filters 𝐶 𝑜𝑛𝑣 𝜖 R 𝑑 × 𝑛 where n = 8, 𝐶 𝑜𝑛𝑣 = 8 and d = 16. AMax-Pooling layer whose characteristics are transferred to FC lay-ers for output immediately comes after the convolution layer inour model. This layer is used to concatenatethe features from the previous CNN layers which are put alongsideeach other to yield a 2 dimensional layer [

𝑈 𝑅𝐿 𝑟 + 𝐻𝑇 𝑀𝐿 𝑟 ] → (None,16) + (None, 16) = (None, 32)]. The FC layers in our model provide it withan added tier for learning more complex representations. The twoFC layers analyze the sequences concatenated from the CNN andMax-Pooling layers while applying a ReLU activation in each FClayer.

The Sigmoid layer which is the finallayer in our model uses the Sigmoid activation to output the resultfrom the model. This last layer which comes after the FC layersquashes the output from the model into the range 0 to 1, accordingto the expression: 𝑄 = + 𝑒 − 𝑞 given the probability of two classes: legitimate or phishing , where 𝑞 = ( 𝑊 𝑅 𝑡 + 𝑏 ) . W and 𝑏 are modelparameters, and 𝑅 𝑡 is the input at time step t . Using Adam, a method posited by [28] forstochastic gradient optimization, we trained WebPhish. Adam is ablend of two conventional methods of optimization: AdaGrad [29],the adaptive gradient algorithm and RMSProp [30], which adds adecomposition term. Adam computes specific adaptive learningrates based on projections for the first and second gradient mo-ments for the different network parameters. Research shows thatAdam performs equivalently or better than some other methods ofoptimization [31], regardless of the hyperparameter environment.Since our deep learning model WebPhish is a binary classificationnetwork, we implemented the binary cross-entropy to monitor itsperformance.

Alongside our proposed end-to-end framework combining bothcharacter-level embedding of the URL and HTML content, we alsoderived three different variants, namely:1. WebPhish-LSTM,2. WebPhish-URL and3. WebPhish-HTML andTheir architectures are detailed in Figure 2b, 3a, and 3b. WebPhish-URL and WebPhish-HTML are CNN models trained on only URLand HTML content, respectively. The character embedding matrixin the Embedding layer is also applied to the CNN and Max-Poolinglayers, which are subsequently passed into the FC layers and results !DOCTYPE html>

HTML

URL

FlattenConcatenate Embedding vectorsHTML Charc Embedding (a) WebPhish-Full

Fully Connected Layers Output

HTML

URL

Concatenate Embedding vectorsHTML Charc EmbeddingURL Charc Embedding LSTM Layers (b) WebPhish-LSTM

Figure 2: Overall Architecture of WebPhish-Full and WebPhish-LSTM. The character embedding matrix in the Embedding layer is applied toConvolutional and the LSTM layer, respectively. outputted through the Sigmoid layer. On the other hand, WebPhish-LSTM is a deep learning model that uses recurrent neural networks,specifically Long Short-Term Memory (LSTM), to learn the repre-sentation of the URL features and HTML content. The characterembedding matrix in the Embedding layer is applied to the LSTMlayer, whose results are concatenated and passed into the FC layers.Classification results are outputted through the Sigmoid layer.

This section elaborates on the experiments conducted to investigateour proposed phishing detection method’s efficiency. Table 1 showsthe associated dataset used for each experiment. Given below is anoutline of each experiment: • Experiment 1 verifies the effectiveness of our proposed methodin detecting phishing on the D1 dataset (Section 5.1) and com-pare its performance with state-of-the-art methods (Section5.4). • Experiment 2 is a longitudinal study to demonstrate the tem-poral resistance of our proposed phishing detection approach(Section 5.2). This experiment illustrates how our model per-forms when detecting a phishing attack on a freshly collecteddataset (D2 dataset). • Experiment 3 shows the influence of the embedding, CNN,and FC layers on our proposed model to detect phishing onweb pages (Section 5.3). • Experiment 4 in Section 5.6, we demonstrate the applica-tion of our DNN model on the US airline dataset to classifycustomer reviews according to their sentiments. This experi-ment shows that our proposed model can perform optimallyon other textual datasets.

We collected real-world datasets from Alexa.com for the legitimateweb pages and phishtank.com for the phishing web pages to trainour models. Using the Beautiful Soup [32] library in Python, wegenerated the HTML documents. We chose to use Beautiful Soupfor the following reasons: (1) it has functional versatility and speedin parsing HTML contents, and (2) Beautiful Soup does not correcterrors when analyzing the HTML Document Object Model(DOM).We created a parser to dynamically extract each web page’sHTML source code from the final landing page. The phishing URLswere gathered from continuously monitoring phishtank.com from11 November 2018 to18 November 2018 for the D1 dataset and from10 January 2019 to 17 January 2019 for the D2 dataset, while we drewthe legitimate URLs from Alexa.com’s top domains. Phishtank.com URL

URL Charc Embedding Convolutional LayerMax Pooling Fully Connected Layers OutputFlatten (a) WebPhish-URL

HTML Charc Embedding Convolutional LayerMax Pooling Fully Connected Layers OutputFlatten

HTML (b) WebPhish-HTML

Figure 3: Configuration of WebPhish-URL and WebPhish-HTML which are CNN Models trained on only URL and HTML Content, respectively.Table 1: Experimental Setting

Experiment Dataset Purpose offers a community-based phish verification system where userssubmit suspected phishes and other users "vote" if it is a phish ornot. The D1 and D2 dataset were collected about 60 days apartbecause different phishing pages created around the same time mayhave the same infrastructure. This method could exaggerate anover-trained classification model’s predictive output.Also, to ensure our model’s deployability to real-world applica-tions, our dataset provided a distribution of phishing to legitimateweb page obtainable on the internet (≈ / ) [33, 34]. In sum-mary, our corpus contained 47,000 legitimate URL and HTML doc-uments and 4,700 phishing URL and HTML documents as shownin Table 2. Note:

Our dataset contains web pages written in different lan-guages. Therefore, this does not limit our model to only detectingEnglish web pages. Also, we removed the prefix in URLs such ashttp:// and https:// and to prevent a skewed result on differentURLs dataset and reduce FPR. We manually sanitized our corpus

Table 2: Web Page Documents Used to Evaluate WebPhish

Dataset D1 D2Date generated

11 - 18 Nov, 2018 10 -17 Jan, 2019

Legitimate Web Pages

Phishing Web pages

Total to ensure we removed replicas or web pages, pointing to emptycontent.

A suitable combination of hyperparameters was needed to tuneWebPhish and its variants. We conducted a grid search to selectthe best combination of CNN layers (ranging from 1 to 3) and thenumber of FC layers (from 1 to 3) to benefit our models. Table10 and Table 9 shows the accuracy and training time obtained onWebPhish-Full when the number of convolutional layers and the able 3: WebPhish HyperparametersHyperparameters Potential Choices Selected Number of Conv1D layers 1 -3 1Number of FC Layers 1 -3 2Embedding Dimension − − − number units in the FC layers were varied. Also, using the gridsearch, we were able to determine the best optimization algorithmsuited for the models (varying between RMSProp and Adam) withina range of learning rates (from 0.0001 to 0.1).Table 3 details the selected parameters we found gave the bestperformance on our dataset bearing in mind the unavoidable hard-ware limitation. We implemented all WebPhish variants in Python3.5 on a Tensorflow 1.2.1 backend. We adjusted the batch size fortraining and testing the model to 20. The Adam optimizer [28],with a learning rate of 0.0015, was used to update the networkweights. At the same time, we implemented binary cross-entropyto monitor the performance of the model. The Early stopping tech-nique [35] was adopted to prevent overfitting on the training data.We conducted all WebPhish and baseline experiments on a GoogleColaboratory environment with 12GB GDDR5 VRAM. We evaluated the performance of WebPhish using

𝑇 𝑃𝑅 = ( 𝑇𝑃 )( 𝑇𝑃 + 𝐹𝑁 ) ∗ and 𝐹 𝑁 𝑅 = ( 𝐹𝑁 )( 𝑇 𝑁 + 𝑇𝑃 ) ∗ where TP, FP, TN, and FN representthe numbers of True Positives, False Positives, True Negatives,and False Negatives, respectively. TPR measures the percentageof accurately predicted phishing URLs out of the total number ofphishing URLs. Simultaneously, FNR calculates the percentage ofincorrectly predicted phishing URLs out of the total number ofphishing URLs.Also the TNR and the FPR metrics we calculated using 𝑇 𝑁 𝑅 = ( 𝑇 𝑁 )( 𝑇 𝑁 + 𝐹𝑃 ) ∗ and 𝐹𝑃𝑅 = ( 𝐹𝑃 )( 𝐹𝑃 + 𝑇 𝑁 ) ∗ . These metrics measurethe percentage of correctly and incorrectly classified legitimateURLs out of the total number of legitimate URLs. Finally, the accu-racy of WebPhish was determined using Equation 1. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ( 𝑇 𝑃 + 𝑇 𝑁 )( 𝑇 𝑃 + 𝑇 𝑁 + 𝐹𝑃 + 𝐹 𝑁 ) ∗ (1)We also used the receiver operating characteristic (ROC) curveand the Area Under the Curve (AUC) in our evaluation. The ROCcurve is a probability curve, while the AUC depicts how much themodel can distinguish between two classes: legitimate or phishing.The higher the AUC value, the better the performance of the model.The ROC curve is plotted with the true positive rate (TPR) againstthe false positive rate (FPR) where

𝑇 𝑃𝑅 = ( 𝑇𝑃 )( 𝑇𝑃 + 𝐹𝑁 ) and 𝐹𝑃𝑅 = ( 𝐹𝑃 )( 𝑇 𝑁 + 𝐹𝑃 ) . To document the performance of WebPhish and its variants on ourcorpus, we split the dataset into 80 percent for training, 10 percentfor validation, and 10 percent for testing. Also, taking cognizance ofour dataset’s imbalanced nature, we ensured we manually shuffledour datasets before training.

In Figure 4a and Figure 4b, we show the ROC curves of WebPhishand its variants. As established in Table 4, WebPhish and its vari-ants, state-of-the-art comparative models, and baseline models weretrained and tested on the D1 dataset. The WebPhish-Full model out-performed all variants and baselines in every metric measured witha precision of 99 percent on the D1 datasetAmongst the WebPhishvariants, WebPhish-HTML was the least performing with an ac-curacy of 96 percent on the D1 dataset.This outcome is becausephishing web pages, especially those hosted on compromised web-sites, are known to systematically copy the legitimate web pagesource code in other to blend in effortlessly.In Table 7, the classification report details how the WebPhishvariants performed for each class on the D1 dataset. For the legiti-mate class, the WebPhish-Full classifier accurately predicted 2394of the 2410 legitimate instances with an accuracy of 98 percent andan F-1score of 98 percent. On the other hand, WebPhish-HTMLclassified 2359 of the legitimate cases correctly with a precision of97 percent and an F-1 score of 96 percent. Further analysis of theclassifications reports in Table 7, and Table 8 WebPhish-Full, alongwith other models, did not perform as well when classifying thephishing instances. For example, in Table 7, 26 out of 230 phishinginstances were predicted to be legitimate. We can attribute thisresult to the imbalanced nature of our corpus. Even though thedistribution of the cases in the corpus reflects the ratio of phishingand legitimate web pages obtainable on the Internet, further workneeds to be done to improve the classifiers’ performance to predicta higher percentage of the phishing class correctly. We will addressthis study in our future work.In general, WebPhish-Full significantly outperforms the otherthree variants, WebPhish-URL, WebPhish-HTML, and WebPhish-LSTM. WebPhish-Full yielded an average of over 98 percent acrossits precision, F-1 score, and recall metrics on the D1 dataset. WebPhish-Full takes advantage of the other variants’ strengths and producesmore consistently better results while capturing local and tempo-ral patterns in the data. Furthermore, from the results, the pre-cision, recall, and F-1 score from the experiment for WebPhishare well-balanced as their values are similar. This result indicatesthat WebPhish can accurately detect phishing web pages whenimplemented in the wild.

The techniques for implementing a phishing web page is contin-uously evolving due to emerging technology applications for de-signing phishing web pages. The evaluation of the resilience of thisevolution is paramount for a phishing web page detection technique.In this paper, we applied the longitudinal study [37] by evaluat-ing the accuracy of the WebPhish-Full using freshly collected data. able 4: Result of WebPhish and Baseline models on the D1 Dataset Models Accuracy Precision Recall F-1 Score Training timeWebPhish-Full 0.98 0.99 0.98 0.98 240 Seconds

WebPhish-URL 0.97 0.97 0.97 0.97 170 SecondsWebPhish-HTML 0.96 0.97 0.96 0.96 432 SecondsWebPhish-LSTM 0.97 0.97 0.97 0.96 360 Seconds[20] 0.97 0.97 0.97 0.97 130 Seconds[21] 0.97 0.97 0.97 0.97 300 Seconds[36] 0.98 0.98 0.98 0.98 210 SecondsKernel SVM + Manual Features 0.93 0.93 0.93 0.91 95 SecondsLogistics Regression + Manual Features 0.95 0.95 0.95 0.95 45 SecondsRandom Forest Classifier + Manual Features 0.97 0.97 0.97 0.97 70 Seconds

Table 5: Result of WebPhish and Baseline models on the D2 Dataset without retraining

Models Accuracy Precision Recall F-1 ScoreWebPhish-Full 0.95 0.95 0.94 0.95

WebPhish-URL 0.86 0.90 0.86 0.87WebPhish-HTML 0.90 0.84 0.90 0.87WebPhish-LSTM 0.90 0.87 0.90 0.88[20] 0.81 0.79 0.81 0.80[21] 0.83 0.84 0.83 0.84[23] 0.92 0.93 0.93 0.92Kernel SVM + Manual Features 0.90 0.91 0.90 0.91Logistics Regression + Manual Features 0.93 0.92 0.93 0.93Random Forest Classifier + Manual Features 0.93 0.92 0.92 0.92

Table 6: Result of WebPhish and Baseline models on the D2 Dataset with retraining

Models Accuracy Precision Recall F-1 Score Training timeWebPhish-Full 0.98 0.98 0.98 0.98 150 Seconds

WebPhish-URL 0.97 0.97 0.97 0.97 110 SecondsWebPhish-HTML 0.95 0.95 0.95 0.95 320 SecondsWebPhish-LSTM 0.97 0.97 0.96 0.97 160 Seconds[20] 0.97 0.97 0.97 0.97 150 Seconds[21] 0.82 0.91 0.82 0.85 125 Seconds[36] 0.96 0.97 0.97 0.96 100 SecondsKernel SVM + Manual Features 0.93 0.94 0.93 0.92 65 SecondsLogistics Regression + TF-IDF + D1 Dataset 0.85 0.84 0.89 0.86 30 SecondsRandom Forest Classifier + Manual Features 0.97 0.97 0.97 0.97 50 Seconds

Table 7: Confusion Matrix of WebPhish and Baseline Models on the D1 dataset

WebPhish-Full WebPhish-HTML WebPhish-URL

Legitimate 2394 26 2359 73 2335 29Phishing 16 204 31 177 55 221Legitimate Phishing Legitimate Phishing Legitimate Phishing

This study enabled us to infer a maximum retraining period forwhich the system’s accuracy does not reduce. For a security sup-plier deploying WebPhish-Full in the wild, the retraining periodcan provide an approximate cost of maintenance.Using the evaluation metrics detailed above, we compared WebPhishvariants’ accuracy and baseline models on the training data D1 withits accuracy when applied to the test data D2 without retrainingthe model. From the results in Table 5, the accuracy of the modelsdropped a few percentages, specifically a minimal 4 percent for theWebPhish-Full model. This result could be due to different phishingcontent structures that might not have been present in the training set. The outcome of our longitudinal study demonstrates the readi-ness of WebPhish-Full for real-world deployment. WebPhish-Fullwill remain temporally robust and will not need retraining withinat least two months.Furthermore, to evaluate the model’s performance when it is re-trained with unseen data ( D2 dataset), using the evaluation metricsdetailed above, we experimented with the performance of WebPhishon the D2 dataset when initialized with transferred learned param-eters from when we trained the model on the D1 dataset . Wealso removed the sigmoid layer from the transferred parametersand replaced it with a new one. The new sigmoid layer is trained able 8: Confusion Matrix of WebPhish and Baseline Models on the D1 dataset Random Forest Classifier Kernel SVM Logistics Regression

Legitimate 2375 61 2383 171 2372 110Phishing 11 193 3 83 14 144Legitimate Phishing Legitimate Phishing Legitimate Phishing

Table 9: The Impact of The FC LayersModels Accuracy Training time

Table 10: The Impact of The Convolutional layersModels Accuracy Training time

Proposed Model 0.982 240 Seconds2 Convolutional Layers 0.983 241 seconds3 Convolutional layers 0.984 244 Seconds from scratch using backpropagation with data from the D2 datasetcorpus.From Table 6, it is clear that the performance of WebPhish andits variants improved by at least 1 percent each and at a reducedtraining and testing time compared with their performance onthe D1 dataset . Our model’s improved performance will remainefficient in detecting phishing attacks targeting new websites toimitate and new features used in phishing architecture.

Note:

Given that WebPhish-Full outperformed the other config-urations, we use WebPhish-Full as the default setting. For the restof this section, we will use WebPhish to indicate WebPhish trainedwith both URL and HTML as input unless otherwise stated.

Many similar deep learning based phishing web page detection mod-els employ common structures. However, in our proposed model,we configured a variable number of FC layers and CNN layers. Sub-sequently, we examine the FC layers and CNN layers’ effect on theaccuracy of the proposed model WebPhish.We show the impact of the FC layer in Table 9. Intuitively, weexpect that more FC layers will mean an increase n the accuracy ofthe model. Our analysis found that the proposed model’s config-uration of 3 FC layers (2 FC layers and 1 Sigmoid layer) gave ourtask’s best performance on the D1 dataset. The proposed modelachieved an accuracy of 0.979, 0.98, and 0.982 percent with 1, 2, and3 FC layers.Table 10 shows the effect of the number of CNN layers in themodel. We found that 1 CNN layer gave the best balance of trainingtime of 240 seconds and an accuracy of 0.982 percent. Using 2 and3 CNN layers, WebPhish can achieve an accuracy of 0.983 percentand 0.984 percent and training time of 241 seconds and 244 seconds,respectively. Furthermore, we demonstrate the importance of the Embeddinglayer in our DNN model. We achieved this analysis by checking theperformance of WebPhish-full on the D1 dataset when the embed-ding layer is replaced with manually engineered features drawnfrom URL and HTML characteristics on the CNN and fully con-nected layers. The manual features are listed in Table 12. Althoughthe manual features’ training time is shorter than with embeddingfeatures, we can see a 4 percent drop across all metrics in Table11. This result highlights the character embedding matrix’s im-portance when analyzing textual content in the URL and HTML.It also demonstrates that using the tedious manual feature engi-neering process could overlook some salient characteristics thatdifferentiate a phishing web page form a legitimate one.

We compared WebPhish-Full with the technique and efficiency ofthe state-of-the-art models in [20],[23] and [21]. [21] is a DeepNeural Network with multiple layers of CNNs that takes as inputword tokens from a URL to determine the maliciousness of theassociated web page. On the other hand, [20] takes as input thecharacter sequence of a URL. It then models its sequential depen-dencies using Long short-term memory (LSTM) neural networksto classify a URL as phishing or benign.[23] takes as input both thecharacter sequence and word sequence of the HTML content of aweb page and uses CNN layers to learn its semantic dependencies.

Note. [20] and [21] were applied to only the URLs in our datasetas the original papers were built for only phishing URL detectionwhile[23] was applied on the HTML contents.Table 4, 5, and 6, shows the precision, recall, and f-1 score ofWebPhish against the state-of-the-art models for the D1 and D2datasets. The ROC curves of the state-of-the-art techniques areshown in Figure 5a and 5b. WebPhish outperforms all state-of-arttechniques (with at least 1 percent improvement in accuracy) in allcategories and metrics.The advantage of using both raw URL and HTML content asinput is evident in the performance of WebPhish when comparedwith [20] and [21] that uses only the URL component. [21] performsthe least amongst the deep neural networks because although theuse of CNN is only a part of the temporal process, they cannotcapture the long-term sequential dependencies in the text features.

We compared URLs and HTML code’s linguistic and statistical anal-ysis as input for traditional machine learning classifiers with ourDNN model, WebPhish. We investigate deep neural networks’ ef-fect in improving phishing web page detection using raw URL and able 11: The Impact of The Embedding Layer Models Accuracy Precision Recall F-1 Score Training time

Proposed Model 0.98 0.99 0.98 0.98 240 SecondsProposed Model without Embedding layer 0.96 0.96 0.96 0.9 168 Seconds T r u e p o s i t i v e r a t e ROC Curves on D2 Dataset With RetrainingWebPhish-LSTM ROC (area = 0.98)WebPhish-URL ROC (area = 0.99)WebPhish-HTML ROC (area = 0.93)WebPhish-Full ROC (area = 0.99) (a) T r u e p o s i t i v e r a t e ROC Curves on D2 Dataset Without RetrainingWebPhish-LSTM ROC (area = 0.76)WebPhish-URL ROC (area = 0.83)WebPhish-HTML ROC (area = 0.69)WebPhish-Full ROC (area = 0.84) (b)

Figure 4: ROC Curves of WebPhish and its Variants T r u e p o s i t i v e r a t e ROC Curves on D2 Dataset With RetrainingOpara et al. ROC (area = 0.86)Bahsen et al. ROC (area = 0.98)Bo et al. ROC (area = 0.51) (a) T r u e p o s i t i v e r a t e ROC Curves on D2 Dataset Without RetrainingOpara et al. ROC (area = 0.88)Bahsen et al. ROC (area = 0.62)Bo et al. ROC (area = 0.52) (b)

Figure 5: ROC Curves of Compared State-of-the-Art Deep Learning Models

HTML content compared with simpler baseline models trained onmanually engineered features. We used three machine learningmodels: logistics regression, kernel SVM, and a random forest clas-sifier. We chose these models because these traditional classifierswere commonly used in sequence detection systems [38] and aretherefore relevant baselines to compare with WebPhish. We used31 features detailed in Table 12 culled from [5], [39], [40], [34].

For the manual features extracted from theURL, research has shown that phishing web pages developers fre-quently exploit an Internet user familiarity with a website [41] by adding terms to the URL that may trick a user into thinkingthat somehow the malicious website is the real website. Widelyused terms to access genuine websites like admin and account usedmakes them particularly vulnerable to imitation. Therefore, the cre-ator of a phishing website would intuitively use ambiguous termsat the URL’s start. Therefore, including those terms in the URL isregarded as a feature. Many malicious domain names are hostingsystems IP addresses.[42], [43]. We counted the combination ofnumbers in a URL and the percentage of numbers in the hostnameas a feature. Also, phishers create several subdomains to includetricky terms, for example, PayPal as a subdomain. This could make .0 0.2 0.4 0.6 0.8 1.0False positive rate0.00.20.40.60.81.0 T r u e p o s i t i v e r a t e ROC Curves on D2 Dataset With Retraining

Log_Regression ROC (area = 0.78)Random_Forest ROC (area = 0.88)SVM ROC (area = 0.66) (a) T r u e p o s i t i v e r a t e ROC Curves on D2 Dataset Without Retraining

Log_Regression ROC (area = 0.75)Random_Forest ROC (area = 0.80)SVM ROC (area = 0.64) (b)

Figure 6: ROC Curves of Machine Learning Models on Manual FeaturesTable 12: Extracted Manual Features from Both URL And HTML Contents on Web Pages

Features Description

URL Features 1. Number of misleading words in the URL such as login and bank 2. Number of forward slashes and question marks 3. Numberof digits 4. Number of dots 5. Number of hyphens and underscores 6. Number of equal signs and ampersand 7. Number oftwo-letter subdomains 8. Number of semicolons 9. Number of subdomains 10. Presence of subdomain 11. % of digits in thehostname 12. Length of URLHTML Features 1. Presence of JavaScript 2. Presence of NoScript 3. Presence of internal JavaScript 4. Presence of external JavaScript 5. Presenceof embedded JavaScript 6. Number of JavaScript 7. Number of NoScript 8. Number of internal JavaScript 9. Number of externalJavaScript 10. Number of embedded JavaScript 11. Presence of internal links 12. Presence of external links 13. Presence ofimages 14. Presence of iframes 15. Number of images 16. Number of internal links 17. Number of external links 18. Number ofiframes 19. Percentage of white spaces in the HTML content phishing URLs longer[43]. Therefore, we included the URL length,if the URL includes a subdomain, the number of sub-domains, andthe number of dots as features. Furthermore, the number of somepunctuation marks such as semicolons, hyphens, and underscores,etc. are included in our URL feature set too.

For the HTML feature set, variables such asthe number of white spaces, presence of internal and external links,and number and presence of images were extracted because of theirrelevance when differentiating between a phishing and legitimateweb page.When the features are collected, a binary classifier is taught usingthe extracted features provided. We empirically set the number oftrees as 70 for the random forest classifier, the penalty for thelogistics regression as L1 , and its kernel bias function (RBF) of thenon-linear SVM as 50.0.Table 4, 5, and 6, shows the precision, recall, and f-1 score ofWebPhish against the traditional machine learning classifiers forthe D1 and D2 datasets. The ROC curves of the traditional ma-chine learning classifiers are shown in Figure 6a and 6a. WebPhishoutperforms all state-of-art techniques (with at least 2 percent im-provement in accuracy) in all categories and metrics. The outputof the traditional machine learning classifiers shows the limita-tion of manually engineered features. It highlights the importanceof the temporal robustness of our proposed method. The random forest classifier yielded better results than the logistics regressionand SVM classifiers. Given that the Random Forest classifier out-performed the other models, we analyzed which features wereinformative to the Random Forest classifier classification results.The algorithm’s top 3 most important feature is the URL’s length,number of Digits on the URL, and the number of misleading wordsin the URL; This is not surprising since attackers will try to deceiveusers by employing suspicious words known by the victims. Also,we observed that phishing URLs tend to have a higher length ratiobetween the length of the path and the hostname.Also, we measured the training and evaluation times of WebPhishand the compared state-of-the-art models. WebPhish needs 120seconds to train while the Logistics regression model’s training timewas less than a minute. Nevertheless, once trained, the WebPhishmodel can conduct phishing detection on one URL and HTMLcontents within 194 𝜇 s. To demonstrate our model’s versatility by applying it to other tex-tual datasets, we evaluated its performance on the publicly availableUS airline dataset. We collected the UR airline dataset from theKaggle databases that were published by CrowdFlower. This data able 13: Result of WebPhish and State-of-the-art models on the US Airline Dataset Classifying 3 classes: Positive, Negative andNeutral Corpus Size No. of Classes Algorithm Accuracy in %

Decision Tree 63Random Forest 75.8SVM 78.5Gaussian Naive Bayes 43.8[44] 14,640 3 Classes AdaBoost 74.6Logistic Regression 78.7Gradient Boosting 73.4Decision Tree Classifier 68.6Voting Classifier 79.2[45] 14,640 3 Classes Gated GRU 74

WebPhish 14,640 3 Classes DCNN 80

Table 14: Result of WebPhish and State-of-the-art models on the US Airline Dataset Classifying 2 classes: Positive and Negative

Corpus Size No. Classes Algorithm Accuracy in % [46] 11,542 2 Classes BiLSTM 93.6[47] 11,542 2 Classes Deep RNN 93

WebPhish 11,542 2 Classes DNN 94.1

80 60 40 20 0 20 406040200204060 01 (a) WebPhish-Full

60 40 20 0 20 40 60 8060402002040 01 (b) WebPhish-HTML

40 20 0 20 40 606040200204060 01 (c) WebPhish-URL

Figure 7: Visualisation of feature embedding of sampled URLs and HTML using WebPhish-full features, WebPhish-URL and WebPhish-HTML.The data points are colour-coded by the sample classes: Legitimate (Blue) and Phishing (Orange). includes 14,640 tweets. These tweets belong to six major U.S. air-lines: American Airlines, United Airlines, US Airways, SouthwestAirlines, Delta Airlines, and Virgin Airlines. Sentiment classifica-tion techniques can help researchers and decision-makers in airlinecompanies better understand customers’ feelings, opinions, andsatisfaction.We took as input into WebPhish the actual text tweeted by thecustomers and the associated airline sentiment confidence. Theairline sentiment confidence is a numeric feature representing theconfidence level of classifying the tweet to either neutral, positive,or negative classes.

Using the evaluation metrics detailed abovein section 4.3, we applied the WebPhish model on the US airlinedataset to evaluate the model’s ability to differentiate betweenpositive and negative emotions in text. Note, as the US airline dataset is publicly available since 2015,previous studies exist on the classification of its sentiments. Wecompared the performance of WebPhish on the US airline datasetwith the following studies: In [44], the authors applied a votingclassifier (VC) to classify tweets according to their emotions on thedataset mentioned above. The VC is based on logistic regression(LR) and stochastic gradient descent classifier (SGDC) and uses asoft voting mechanism to make the final prediction. Also, using TF-IDF as a feature extraction mechanism, the authors implemented aphrase-level analysis on seven classification algorithms: DecisionTree, Random Forest, SVM, Gaussian Naive Bayes, AdaBoost, Logis-tic Regression, Gradient Boosting, Decision Tree Classifier, VotingClassifier. We also compared WebPhish with the system proposedby [45]. Using Doc2Vec embeddings on the US airline dataset, [45] xplored the use of a bi-directional gated recurrent unit (GRU) net-work for sentiment analysis of Twitter data directed at US airlines.They fed into the bi-directional GRU network a trained word vec-tor through a skip-gram model, initialized with the existing GloVemodel [48].Again, we compared our model with DICET proposed by [46].DICET is an automated text pre-processor fed into a BidirectionalLSTM with attention to detect Twitter sentiment analysis. Also,[47] proposed a sentiment analysis model that extracts relevantfeatures from the US airline dataset using a Hadoop cluster. Theobtained features are fed into a deep RNN network to perform theclassification, providing two classes: positive and negative reviews.The studies in [46] and [47] removed the neutral class from theUS airline dataset during experimentation, thereby reducing thedataset to 11,541 tweets. On the other hand, the works in [44] and[45] experimented with the full corpus of the US Airline dataset of14,640 instances. Consequently, to ensure fairness, WebPhish wastrained and evaluated on both the full and abridged dataset.Table 13 and 14 details the result of WebPhish and other state-of-the-art techniques on the US airline dataset. WebPhish indicates asignificant improvement in efficiency than current state-of-the-artmethods trained on the US airline Twitter datasets.WebPhish, with an accuracy of 94 percent, is higher than those ofother models for the US airlines dataset. We can effectively concludethat our proposed model is a robust solution that applies to othertext classification fields beyond social engineering. The core reasonsbehind the flexibility of WebPhish may include: (i) the concatenationof the embedding of raw textual content ensures its extendibility tounseen data. (ii) The processing of language nuances by identifyingstrong connections within the text. This section simulates the embedding feature information of theURL and HTML instances extracted from the WebPhish model andits variants trained on the D2 dataset. Here we chose the featuresequences after the HTML and URL embedding layers were concate-nated (Concatenation layer in Figure 2a), and a 16-dimensional vec-tor was obtained. For the baseline features, we extract the URL fea-tures used in WebPhish-URL and the HTML features in WebPhish-HTML. From the extracted feature vectors, we apply t-SNE [49] toreduce feature dimension and plot the concatenation of the HTMLand URLs on a 2-dimensional embedding space for our proposedmodel and only the URLs for the baseline model.The Figure of the concatenated embedded HTML and URLs canbe seen in Figure 7a, only the embedded URLs in Figure 7c and onlyHTML in Figure 7b. As shown in Figure 7a, for WebPhish-Full, thephishing and legitimate web pages are separated into two groupsof content. The right part of the plot contains most of the legiti-mate websites, while the phishing websites are located on the leftside. Very few phishing instances overlap with legitimate instances.In the WebPhish variants, the separation between phishing andlegitimate URLs is not as clear. Different from WebPhish-Full, inthe WebPhish-URL and WebPhish-HTML embeddings, there areseveral distinct data points spread across the plot, making it difficultto establish the clusters of those data points.

In this paper, we proposed a web page phishing detection technique.We adopted a DNN, specifically Convolutional Neural Networks, tocapture the semantic dependencies on raw URL and HTML contentwhile employing character embedding techniques to initiate auto-matic feature extraction. Evaluation results based on real-worldphishing and legitimate web page content demonstrate the effec-tiveness of our proposed model. The future work is implementour model as a browser extension. This will enable WebPhish todetermine the maliciousness of a website in real-time.

ACKNOWLEDGMENTS

The authors hereby acknowledge the Petroleum Technology Devel-opment Fund (PTDF), Nigeria for the funding and support providedfor this work.

REFERENCES [1] O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learning based phishingdetection from urls,”

Expert Systems with Applications , vol. 117, pp. 345–357, 2019.[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521, no. 7553, p.436, 2015.[3] C. N. Gutierrez, T. Kim, R. Della Corte, J. Avery, D. Goldwasser, M. Cinque,and S. Bagchi, “Learning from the ones that got away: Detecting new forms ofphishing attacks,”

IEEE Transactions on Dependable and Secure Computing , vol. 15,no. 6, pp. 988–1001, 2018.[4] E. Buber, B. Dırı, and O. K. Sahingoz, “Detecting phishing attacks from url by usingnlp techniques,” in

International Conference on Computer Science and Engineering(UBMK), 2017 . IEEE, 2017, pp. 337–342.[5] C. Amrutkar, Y. S. Kim, and P. Traynor, “Detecting mobile malicious webpages inreal time,”

IEEE Transactions on Mobile Computing , vol. 16, no. 8, pp. 2184–2197,2017.[6] K. L. Chiew, E. H. Chang, W. K. Tiong et al. , “Utilisation of website logo forphishing detection,”

Computers & Security , vol. 54, pp. 16–26, 2015.[7] Y. Ding, N. Luktarhan, K. Li, and W. Slamu, “A keyword-based combinationapproach for detecting phishing webpages,” computers & security , vol. 84, pp.256–275, 2019.[8] G. Varshney, M. Misra, and P. K. Atrey, “A phish detector using lightweight searchfeatures,”

Computers & Security , vol. 62, pp. 213–228, 2016.[9] L. Li, E. Berki, M. Helenius, and S. Ovaska, “Towards a contingency approachwith whitelist-and blacklist-based anti-phishing applications: what do usabilitytests indicate?”

Behaviour & Information Technology , vol. 33, no. 11, pp. 1136–1147,2014.[10] B. Krishnamurthy, O. Spatscheck, J. Van Der Merwe, and A. Ramachandran,“Method and apparatus for identifying phishing websites in network traffic usinggenerated regular expressions,” Nov. 6 2012, uS Patent 8,307,431.[11] T.-C. Chen, T. Stepan, S. Dick, and J. Miller, “An anti-phishing system employingdiffused information,”

ACM Transactions on Information and System Security(TISSEC) , vol. 16, no. 4, pp. 1–31, 2014.[12] P. A. Barraclough, M. A. Hossain, M. Tahir, G. Sexton, and N. Aslam, “Intelligentphishing detection and protection scheme for online transactions,”

Expert Systemswith Applications , vol. 40, no. 11, pp. 4697–4706, 2013.[13] G. Ho, A. Cidon, L. Gavish, M. Schweighauser, V. Paxson, S. Savage, G. M. Voelker,and D. Wagner, “Detecting and characterizing lateral phishing at scale,” in { USENIX } Security Symposium ( { USENIX } Security 19) , 2019, pp. 1273–1290.[14] T. Peng, I. Harris, and Y. Sawa, “Detecting phishing attacks using natural languageprocessing and machine learning,” in . IEEE, 2018, pp. 300–301.[15] S. Marchal, K. Saari, N. Singh, and N. Asokan, “Know your phish: Novel techniquesfor detecting phishing sites and their targets,” in . IEEE, 2016, pp. 323–333.[16] S. Smadi, N. Aslam, and L. Zhang, “Detection of online phishing email usingdynamic evolving neural network based on reinforcement learning,”

DecisionSupport Systems , vol. 107, pp. 88–102, 2018.[17] A. K. Jain and B. B. Gupta, “Towards detection of phishing websites on client-sideusing machine learning based approach,”

Telecommunication Systems , vol. 68,no. 4, pp. 687–700, 2018.[18] B. B. Gupta, A. Tewari, A. K. Jain, and D. P. Agrawal, “Fighting against phishingattacks: state of the art and future challenges,”

Neural Computing and Applications ,vol. 28, no. 12, pp. 3629–3654, 2017.[19] Google, “Google safe browsing,” http://code.google.com/apis/safebrowsing/, 2019,accessed: 2019-09-30.

20] A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A. González, “Classi-fying phishing urls using recurrent neural networks,” in

Electronic Crime Research(eCrime), 2017 APWG Symposium on . IEEE, 2017, pp. 1–8.[21] B. Wei, R. A. Hamad, L. Yang, X. He, H. Wang, B. Gao, and W. L. Woo, “A deep-learning-driven light-weight phishing detection sensor,”

Sensors , vol. 19, no. 19,p. 4258, 2019.[22] H. Le, Q. Pham, D. Sahoo, and S. C. Hoi, “Urlnet: Learning a url representationwith deep learning for malicious url detection,” arXiv preprint arXiv:1802.03162 ,2018.[23] C. Opara, B. Wei, and Y. Chen, “Htmlphish: Enabling phishing web page detectionby applying deep learning techniques on html analysis,” in , 2020, pp. 1–8.[24] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprintarXiv:1408.5882 , 2014.[25] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,and Y. Bengio, “Learning phrase representations using rnn encoder-decoder forstatistical machine translation,” arXiv preprint arXiv:1406.1078 , 2014.[26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural computation ,vol. 9, no. 8, pp. 1735–1780, 1997.[27] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of cnn and rnn fornatural language processing,” arXiv preprint arXiv:1702.01923 , 2017.[28] D. Kingma and J. Ba, “Adam: a method for stochastic optimization (2014),” arXivpreprint arXiv:1412.6980 , vol. 15, 2015.[29] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for onlinelearning and stochastic optimization,”

Journal of Machine Learning Research ,vol. 12, no. Jul, pp. 2121–2159, 2011.[30] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a run-ning average of its recent magnitude,”

COURSERA: Neural networks for machinelearning , vol. 4, no. 2, pp. 26–31, 2012.[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXivpreprint arXiv:1412.6980 , 2014.[32] L. Richardson,

Beautiful Soup

NDSS , vol. 10, 2010, p. 2010.[34] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: a content-based approach todetecting phishing web sites,” in

Proceedings of the 16th international conferenceon World Wide Web , 2007, pp. 639–648.[35] L. Prechelt, “Early stopping-but when?” in

Neural Networks: Tricks of the trade .Springer, 1998, pp. 55–69.[36] C. Opara, B. Wei, and Y. Chen, “Htmlphish: Enabling accurate phishing web pagedetection by applying deep learning techniques on html analysis,” arXiv preprintarXiv:1909.01135 , 2019.[37] S. Marchal, G. Armano, T. Gröndahl, K. Saari, N. Singh, and N. Asokan, “Off-the-hook: An efficient and usable client-side phishing prevention application,”

IEEETransactions on Computers , vol. 66, no. 10, pp. 1717–1733, 2017.[38] M. M. Mirończuk and J. Protasiewicz, “A recent overview of the state-of-the-artelements of text classification,”

Expert Systems with Applications , vol. 106, pp.36–54, 2018.[39] M. A. Adebowale, K. T. Lwin, E. Sanchez, and M. A. Hossain, “Intelligent web-phishing detection and protection scheme using integrated features of images,frames and text,”

Expert Systems with Applications , vol. 115, pp. 300–313, 2019.[40] R. M. Mohammad, F. Thabtah, and L. McCluskey, “An assessment of featuresrelated to phishing websites using an automated technique,” in . IEEE, 2012, pp.492–497.[41] R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works-proceedings of thesigchi conference on human factors in computing systems,” in

CHI , vol. 6, 2006,p. 581.[42] I. Fette, N. Sadeh, and A. Tomasic, “Learning to detect phishing emails,” in

Proceed-ings of the 16th international conference on World Wide Web , 2007, pp. 649–656.[43] D. K. McGrath and M. Gupta, “Behind phishing: An examination of phisher modioperandi.”

LEET , vol. 8, p. 4, 2008.[44] F. Rustam, I. Ashraf, A. Mehmood, S. Ullah, and G. S. Choi, “Tweets classificationon the base of sentiments for us airline companies,”

Entropy , vol. 21, no. 11, p.1078, 2019.[45] Y. Tang and J. Liu, “Gated recurrent units for airline sentiment analysis of twitterdata,” 2016.[46] U. Naseem and K. Musial, “Dice: Deep intelligent contextual embedding for twittersentiment analysis,” in . IEEE, 2019, pp. 953–958.[47] M. Khan and A. Malviya, “Big data approach for sentiment analysis of twitter datausing hadoop framework and deep learning,” in . IEEE,2020, pp. 1–5.[48] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts,“Recursive deep models for semantic compositionality over a sentiment treebank,”in

Proceedings of the 2013 conference on empirical methods in natural language processing , 2013, pp. 1631–1642.[49] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

Journal of machinelearning research , vol. 9, no. Nov, pp. 2579–2605, 2008., vol. 9, no. Nov, pp. 2579–2605, 2008.