[PDF] HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis

Abstract

Recently, the development and implementation of phishing attacks require little technical skills and costs. This uprising has led to an ever-growing number of phishing attacks on the World Wide Web. Consequently, proactive techniques to fight phishing attacks have become extremely necessary. In this paper, we propose HTMLPhish, a deep learning based data-driven end-to-end automatic phishing web page classification approach. Specifically, HTMLPhish receives the content of the HTML document of a web page and employs Convolutional Neural Networks (CNNs) to learn the semantic dependencies in the textual contents of the HTML. The CNNs learn appropriate feature representations from the HTML document embeddings without extensive manual feature engineering. Furthermore, our proposed approach of the concatenation of the word and character embeddings allows our model to manage new features and ensure easy extrapolation to test data. We conduct comprehensive experiments on a dataset of more than 50,000 HTML documents that provides a distribution of phishing to benign web pages obtainable in the real-world that yields over 93 percent Accuracy and True Positive Rate. Also, HTMLPhish is a completely language-independent and client-side strategy which can, therefore, conduct web page phishing detection regardless of the textual language.

Full PDF

HHTMLPhish: Enabling Phishing Web Page Detectionby Applying Deep Learning Techniques on HTMLAnalysis

Chidimma Opara

Teesside UniversityMiddlesbrough TS1 3BX, [email protected]

Bo Wei

Northumbria UniversityNewcastle upon Tyne, [email protected]

Yingke Chen

Teesside UniversityMiddlesbrough, [email protected]

ABSTRACT

Recently, the development and implementation of phishingattacks require little technical skills and costs. This uprisinghas led to an ever-growing number of phishing attacks onthe World Wide Web. Consequently, proactive techniques tofight phishing attacks have become extremely necessary. Inthis paper, we propose HTMLPhish, a deep learning baseddata-driven end-to-end automatic phishing web page clas-sification approach. Specifically, HTMLPhish receives thecontent of the HTML document of a web page and employsConvolutional Neural Networks (CNNs) to learn the seman-tic dependencies in the textual contents of the HTML. TheCNNs learn appropriate feature representations from theHTML document embeddings without extensive manual fea-ture engineering. Furthermore, our proposed approach ofthe concatenation of the word and character embeddingsallows our model to manage new features and ensure easyextrapolation to test data. We conduct comprehensive exper-iments on a dataset of more than 50,000 HTML documentsthat provides a distribution of phishing to benign web pagesobtainable in the real-world that yields over 93% Accuracyand True Positive Rate. Also, HTMLPhish is a completelylanguage-independent and client-side strategy which can,therefore, conduct web page phishing detection regardlessof the textual language.

KEYWORDS

Phishing detection, Web pages, Classification model, Convo-lutional Neural Networks, HTML

The infamous phishing attack is a social engineering tech-nique that manipulates internet users into revealing privateinformation that may be exploited for fraudulent purposes[1]. This form of cybercrime has recently become commonbecause it is carried out with little technical ability and sig-nificant cost [2]. The proliferation of phishing attacks isevident in the 46% increase in the number of phishing web-sites identified between October 2018 and March 2019 by the Anti-Phishing Working Group (APWG) [3]. Most phishingattacks are started by an unsuspecting Internet user merelyclicking on a link in a phishing email message that leads to abogus website. The impact of phishing attacks on individualssuch as identity theft, psychological, and financial costs canbe devastating.

Problem Definition

Recent research in phishing detection approaches has re-sulted in the rise of multiple technical methods such as aug-menting password logins [4], and multi-factor authentication[5]. However, these techniques are usually server-side sys-tems that require the Internet user to correspond with aremote service, which adds further delay in the communi-cation channel. Another popular phishing detection systemthat relies on a centralised architecture is the phishing black-list and whitelist methods [6]. A URL visited by an internetuser will be compared with the URL in these lists in real-time. Although the list based methods tend to keep the falsepositive rate low, however, a significant shortcoming is thatthe lists are not exhaustive, and they fail to detect zero-dayphishing attacks. To mitigate these limitations, researchershave developed several anti-phishing techniques using ma-chine learning models as they are mostly client-side basedand can generalise their predictions on unseen data.Machine learning-based anti-phishing techniques typi-cally follow specific approaches: (1) The required represen-tation of features is firstly extracted, then (2) a phishingdetection machine learning model is trained using the fea-ture vectors. To extract the feature representation from thelexical and static components of a web page, the machinelearning models rely on the assumption that the infrastruc-ture of phishing pages are different from legitimate pages.For example, in [7], phishing web pages are automaticallydetected based on handcrafted features extracted from theURL, HTML content, network, and JavaScript of a web page.Furthermore, natural language processing techniques arecurrently used to extract specific features such as the num-ber of common phishing words, type of ngram, etc. from thecomponents of a web page [8–10]. a r X i v : . [ c s . CR ] M a y hile the above approaches have proven successful, theynevertheless are prone to several limitations, particularly inthe context of HTML analysis: i. inability to accommodateunseen features: As the accuracy of existing models dependson how comprehensive the feature set is and how imper-vious the feature set remains to future attacks, they willbe unable to correctly detect new phishing web pages withevolved content and structure without a regular update ofthe feature set. ii. They require substantial manual featureengineering:

Existing phishing detection machine learningmodels require specialised domain knowledge in order to as-certain the needed features suitable to each task (e.g., numberof white spaces in the HTML content, number of redirects,and iframes, etc.). This is a tedious process, and these hand-crafted features are often targeted and bypassed in futureattacks. It is also challenging to know the best features forone particular application.To address the above issues, we propose HTMLPhish, adeep learning based data-driven end-to-end automatic phish-ing web page classification approach. Specifically, HTML-Phish uses both the character and word embedding tech-niques to represent the features of each HTML document.Then Convolutional Neural Networks (CNNs) are employedto model the semantic dependencies.The following characteristics highlight the relevance ofHTMLPhish to web page phishing detection:(1) HTMLPhish analyses HTML directly to help reserveuseful information. It also removes the arduous task requiredfor the manual feature engineering process.(2) HTMLPhish takes into consideration all the elements ofan HTML document, such as text, hyperlinks, images, tables,and lists, when training the deep neural network model.We experimentally demonstrate the significance of char-acter and word embedding features of HTML contents indetecting phishing web pages. We then propose a state-of-the-art HTML phishing detection model, in which thecharacter and word embedding matrices are concatenatedbefore employing convolutions on the represented features.Our proposed approach ensures an adequate embedding ofnew feature vectors that enables straightforward extrapo-lation of the trained model to test data. Subsequently, weconduct extensive evaluations on a dataset of over 50,000HTML documents collected over two months. This ensuresour evaluation settings reproduces real-world situations inwhich models are applied to data generated up to the presentpoint and applied to new data.We summarise the main contributions of this paper asfollows: • Different from existing methods, our proposed model,HTMLPhish, to the best of our knowledge, is the firstto use only the raw content of the HTML document of a web page to train a deep neural network modelfor phishing detection. Manual feature engineer-ing is reduced as HTMLPhish learns the represen-tation in the features of the HTML document, andwe do not depend on any other complicated or spe-cialist features for the task. Our proposed approachtakes advantage of the word and character embed-ding matrix to present a phishing detection modelthat automatically accommodates new features andis therefore easily applied to test data. • We conduct extensive evaluations on a dataset ofmore than 50, 000 HTML documents collected intwo months. The distribution of the instances inour dataset is similar to the ratio of phishing andlegitimate web pages found in the real-world. Thisensures that our evaluation metrics and results arerelevant to existing systems. • Furthermore, we carried out a longitudinal studyon the efficiency HTMLPhish to infer the maximumretraining period, for which the accuracy of the sys-tem does not reduce. Our result only recorded aminimal 4% decrease in accuracy on the test data.This confirms that HTMLPhish remains reliable andtemporally robust over a long period.We organised the remainder of the paper as follows: thenext section provides an overview of related works on pro-posed techniques of detecting phishing on web pages. Sec-tion 3 gives the prior knowledge on Convolutional NeuralNetworks, and Section 4 provides an in-depth description ofour proposed model. Section 5 elaborates on the dataset col-lection, while the detailed results on the evaluations of ourproposed model are found in Section 6. Finally, we concludeour paper in Section 7.

In this section, we address two most closely related topicsto our work: the phishing web page detection using featureengineering and the Deep Learning method (especially forNLP).

Feature Engineering for Phishing Web PageDetection

These techniques extract specific features from a web pagesuch as JavaScript, HTML web page, URL, and network fea-tures. These are fed into machine learning algorithms tobuild a classification model. These machine learning tech-niques differ in the type of heuristics and number of featuresets used and the optimisation algorithm applied to the ma-chine learning algorithm. These techniques are based on thefact that both the phishing and benign web pages have a ifferent content distribution of extracted features. The ac-curacy of heuristics and machine learning-based techniquescritically depends on the type of features extracted, and themachine learning algorithm applied. Many phishing detec-tion techniques have been built on different proposed featuresets.Varshney et al [11] proposed LPD, a client-side based webpage phishing detection mechanism. The strings from theURL and page title from a specified web page is extractedand searched on the Google search engine. If there is amatch between the domain names of the top T search resultsand the domain name of the specified URL, the web page isconsidered to be legitimate. The result from their evaluationsgave a true positive rate of 99.5%.Smadi et al. [12] proposed a neural network model thatcan adapt to the dynamic nature of phishing emails usingreinforcement learning. The proposed model can handlezero-day phishing attacks and also mitigate the problem ofa limited dataset using an updated offline database. Theirexperiment yielded a high accuracy of 98.63% on fifty featuresextracted from a dataset of 12,266 emails.The selection of features from various web page elementscan be an expensive process from security risk and techno-logical workload angle. For example, it can be prolongedand somewhat problematic to extract specific feature sets.Besides, it needs specialist domain expertise to define whichfeatures are essential. Deep Learning

Due to its performance in many applications, Deep Learn-ing has attracted increased interest in recent years [13–15].The core concept is to learn the feature representation fromunprocessed data instantaneously without any manual fea-ture engineering. Under this premise, we want to use DeepLearning to detect phishing HTML content by directly learn-ing how features from the raw HTML string is representedinstead of using specialist features that are manually engi-neered.As we want to train our Deep Learning networks usingtextual features, it is, therefore, essential to discuss NLPas it relates to Deep Learning. Deep learning techniqueshave been successful in a lot of NLP tasks, for example, indocument classification [16], machine translation [17], etc.Recurrent neural networks (e.g., LSTM [18]) have been ex-tensively applied due to their ability to exhibit temporalbehaviour and capture sequential data. However, CNN hasbecome brilliant substitutes for LSTMs, especially showingexcellent performance in text classification and sentimentanalysis as CNN learns to recognize patterns across space[19]. Very few attempts have been made to use Deep Learningto detect phishing web pages using web page components.Bahnsen et al. [20] proposed a phishing classifying schemethat used features of the URLs of a web page as input andimplemented the model on an LSTM network. The resultsyielded gave an accuracy of 98.7% accuracy on a corpus of 2million phishing and legitimate URLs. The authors of [21]proposed a CNN based model which combines the outputsof two Convolutional layers to detect malicious URLs.However, our review did not find any existing approachthat detects malicious phishing web pages using only HTMLdocuments on Deep Learning. HTMLPhish learns the se-mantic information present only in the character and wordsin an HTML document to determine the maliciousness ofthe web page. Our thorough analysis shows that phishingweb pages can be detected using only their HTML documentcontent.

We define the problem of detecting phishing web pages usingtheir HTML content as a binary classification task for predic-tion of two classes: legitimate or phishing . Given a datasetwith T HTML documents {( html , y ) , ..., ( html T , y T )} , where html t for t = 1, . . . , T represents an HTML document ,while y t ∈ { , } is its label. y t = y t = Deep Neural Network for Phishing HTML DocumentDetection

The deep neural network that underlies HTMLPhish is a Con-volutional Neural Network (CNN). To detail a basic CNN forHTML document classification, an HTML document is com-prised of a string of characters or words. Our goal is to obtainan embedding matrix html →s ϵ R maxlen × d , in a way that s is made up of sets of adjoining inputs s i ∈ ( , , ..., maxlen ) in a string, in which the input can be individual characters orwords from the HTML document. Each input is subsequentlytransformed in an embedding s i ϵ R d is the i th column of S and the d-dimension is the vector size which is automati-cally initialized and learnt together with the remainder ofthe model.In this paper, the embedding matrix was automaticallyinitialised, and for parallelisation, all sequences were paddedto the same length maxlen .The CNN performs a convolution operation ⊗ over sϵ R maxlen × d using: c i = f ( M ⊗ s i : i + n − + b i ) followed by a non-linear activation where b i is the bias, M is the convolving filter and n is the kernel size of theconvolution operation. After the convolution, a pooling step s applied (which in our model is the Max Pooling) in orderto decrease the feature dimension and determine the mostimportant features.The CNN is capable of exploiting the temporal relationof n kernel size in its input using the filter M to convolveon each segment of n kernel size. A CNN model typicallycontains several sets of filters with different kernel sizes (n) .Those are the model hyperparameters that are set by theuser. In this deep neural network, the convolution layer isusually followed by a Pooling layer. The features from thePooling layer are then passed to dense layers to perform therequired classification. The entire network is then trained byusing backpropagation. Note:

In order to differentiate our state-of-the-art modelfrom the baseline models, for the rest of this paper, wewill use the term HTMLPhish-Full to indicate HTMLPhishtrained with the proposed model unless otherwise stated,while HTMLPhish-Character and HTMLPhish-Word repre-sent the deep neural network model using only the characterand word embedding respectively.

In this section, we elaborate on the architecture of our pro-posed deep neural network model HTMLPhish-Full. Thenetwork architecture seen in Figure 3 shows HTMLPhish-Full has two input layers. The first input layer processes theraw HTML document into an embedding matrix made upof character-level feature representations, while the secondinput layer does the same with words. These two branchesare concatenated in a dense layer called the Concatenationlayer. Therefore, the embedding matrix in this model is thesum of the character-level embedding matrix and the wordembedding matrix C em + W em where C em →c ϵ R maxlen × d ,and W em →w ϵ R maxlen × d . The features in the Concatena-tion layer allows the preservation of the original informationin the HTML content. In the concatenation layer, the contentof both embedding layers are put alongside each other toyield a 3 dimensional layer [ C em + W em →(None, 180, 100)+ (None, 2000, 100) = (None, 2180, 100)].To generate the character-level embedding matrix C em ,the model learns an embedding, which takes the characteris-tics of the characters in an HTML document. To do so, allthe distinct characters, including punctuation marks in thecorpus, are listed. We obtained 167 unique characters. We setthe length of the sequences maxlen =

180 characters. EveryHTML document with strings greater than 180 charactersis cut from the 180th character, and any HTML documentwith characters smaller than 180 characters would be paddedup to 180 with zeroes. Before each character in our workis embedded into a d-dimensional vector, we conduct a to-kenization on the characters in the HTML document and segment the characters into tokens as shown in Figure 1. Anindex is associated with each token before being applied toa d-dimensional character embedding vector where d is setat 100, which is automatically initialised and learnt togetherwith the remainder of the model. To facilitate its implemen-tation, each HTML document html is transformed into amatrix, html →c ϵ R maxlen × d , where d = 100 and maxlen = 180.For the word embedding matrix W em , firstly, the rawHTML document is processed into word-level representa-tions by the word embedding layer. To achieve this, all thedifferent words in the HTML document of the training cor-pus are listed using the following approach: An HTML docu-ment is split into individual words while treating all punctu-ation characters as separate tokens. For example, as shownin Figure 1, < ! DOCTY PE html > , will be split into [ (cid:48) < (cid:48) , (cid:48) ! (cid:48) , (cid:48) DOCTY PE (cid:48) , (cid:48) html (cid:48) ] . We surmise that punctuation marksprovide important information benefits for phishing HTMLdocument detection since punctuation marks are more preva-lent and useful in the context of HTML documents than or-dinary languages. HTML contains a sequence of markuptags that are used to frame the elements on a website. Tagscontain keywords and punctuation marks that define theformatting and display of the content on the Web browser.The listed unique words are used to create a dictionary whereevery word becomes a feature. We obtained about 321,009unique words in our dataset. We also padded the HTML doc-uments to make the lengths of the HTML documents uniformin terms of number of words ( maxlen = ) . Each uniqueword is then embedded into a d-dimensional vector, where dis set at 100, which is automatically initialised and learnedtogether with the remainder of the model. All the HTMLdocuments are converted to their respective matrix repre-sentation ( maxlen × d ) , on which the CNN is applied whered = 100 and maxlen = 2000. Figure 1 shows an overview ofthe character and word embedding layer. Word EmbeddingCharacter Embedding < ! D O < !

DOCTYPE html

Sequence of IntegersEmbedding Matrix

Figure 1: Configuration of the Embedding Layer able 1: HTML Documents Used in this Paper Dataset D1 D2Date generated

11 - 18 Nov, 2018 10 -17 Jan, 2019

Legitimate Web Pages

Phishing Web pages

Total

We can now introduce Convolutionary layers using theHTML document matrix (for all the HTML documents s t ∀ t = , ..., T ) as the corpus. We applied 32 Convolutionary filters Mϵ R d × n where n

8. The Max-Pooling layer whose featuresare then passed to a 10 unit dense layer comes after theConvolutionary filters. The dense layer, which is regularisedby dropout, finally connects to a Sigmoid layer. Then usingthe ADAM optimisation algorithm [22], we train the modelthrough backpropagation.

Baseline Models

The baseline models, HTMLPhish-Character and HTML-Phish-Word, whose architectures are detailed in Figure 3, areCNN models trained either on character-level embeddings orword-level embeddings, respectively. The embedding matri-ces described above are applied to 32 Convolutionary filters Mϵ R d × n where n

8. The next layer after the Convolution-ary filters is the Max-Pooling layer, whose features are thenpassed to a 10 unit dense layer. The Dense layer, which alsois regularised by dropout, finally connects to a Sigmoid layer.Also, the models are trained through backpropagation usingthe ADAM optimisation algorithm.

Data collection plays an essential role in phishing web pagedetection. In our approach, we collated HTML documentsusing a web crawler. We used the Beautiful Soup [23] libraryin Python to create a parser that dynamically extracted theHTML document from each final landing page. We chose touse Beautiful Soup for the following reasons:(1) it has functional versatility and speed in parsing HTMLcontents, and(2) Beautiful Soup does not correct errors when analysingthe HTML Document Object Model (DOM). The HTML doc-uments in our corpus include all the contents of an HTMLdocument, such as text, hyperlinks, images, tables, lists, etc.Figure 2 shows an overview of the data collection stage.

Data Collection

Since phishing campaigns follow temporal trends in the com-position of web pages, the earliest data obtained should al-ways be used for training and the most recent data collectedfor testing [24]. Different phishing pages created during the same time may probably have the same infrastructure.This could exaggerate an over-trained classification model’spredictive output. To ensure our evaluation settings repro-duces real-world situations in which models are applied ondata generated up to the present point and applied on newweb pages, we collected a dataset of HTML documents fromphishing and legitimate web pages over 60 days.Also, to ensure the deployability of our model to real-wordsystems, our data set is required to provide a distribution ofphishing to benign web pages obtainable on the Internet inthe real-world (≈ / ) [25, 26]. Given that when a bal-anced dataset (1/1), is used, the results can yield a baselineerror [27]. Consequently, our training dataset D1 consistingof HTML documents from 23,000 legitimate URLs and 2,300phishing URLs was collected between 11 November 2018 to18 November 2018. D1 dataset was used to train and vali-date the three different variants of our model (HTMLPhish-Character, HTMLPhish-Word, and HTMLPhish-Full). From10 January 2019 to 17 January 2019, testing data set D2 con-sisting of HTML document from 24,000 legitimate URLs and2,400 phishing URLs were generated.Note that D1 ∩ D2 = ∅ . Also, our testing dataset D2 ,is slightly larger than our training dataset D1 . This is be-cause learning with fewer data, and having decent tests ona broader test data means that the detection technique isgeneralised. This ensures that the features and model ofclassification include specific features from legitimate andphishing web pages and that the approach can be appliedto the vast number of online Web pages. In total, our cor-pus was made up of 47,000 legitimate HTML documents and4,700 phishing HTML documents, as shown in Table 1.The legitimate URLs were drawn from Alexa.com’s top500,000 domains, while the phishing URLs were gatheredfrom continuously monitoring Phishtank.com. The webpages in our dataset were written in different languages.Therefore, this does not limit our model to only detectingEnglish web pages. We manually sanitised our corpus toensure no replicas or web pages that are pointing to emptycontent. Alexa.com offers a top list of working websites thatinternet users frequently visit, so it is an excellent source tobe used for our aim. Table 2 details the selected parameters we found gave the bestperformance on our dataset bearing in mind the unavoidablehardware limitation for our proposed HTMLPhish variants:a. HTMLPhish-Characterb. HTMLPhish-Wordc. HTMLPhish-Full ata Collection Tokenization

Length padding

Embedding

Convolutional

Filters

Sigmoid

LayerDense

Layer Output

Label

User Web page

Extract HTML

Preprocessing

Deep Neural Network

Figure 2: A Schematic Overview of the Stages Involved in Our Proposed Model

Input HTML DocumentCharacter Embedding

No. Unique Charc X Maxlen X 100

Word Embedding

No. Unique Words X Maxlen X 100

32 Convolutional Filters

With 8 Kernel Sizes

32 Convolutional Filters

With 8 Kernel Sizes

Max Pooling Max Pooling

Dense Layer (10 Units)

Activation = ReLU

Dense Layer (10 Units)

Activation = ReLUInput HTML Document

HTML

Document

ClassificationSigmoid Layer

HTML

Document

Classification

Sigmoid Layer

HTMLPhish-Character HTMLPhish-Word

Character Embedding

No. Unique Charc X Maxlen X 100Input HTML Document C em EmbeddingConcatenated Character and Word representations

Max PoolingDense Layer (10 Units)Activation = ReLUHTML Document ClassificationSigmoid Layer Input HTML DocumentWord Embedding

No. Unique Words X Maxlen X 100 W em Embedding

32 Convolutional Filters

With 8 Kernel Sizes

HTMLPhish-Full

Figure 3: The Overall Architecture of HTMLPhish Variants able 2: HTMLPhish-Full Deep Neural Network Layers Values Activation

Embedding Dimension = 100 -Convolution Filter = 32, Filter Size =8 ReLUMax Pooling Pool Size = 2 -Dense1 No. of Neurons = 10,Dropout = 0.5 ReLUDense2 No. of Neurons = 1 SigmoidTotal Number ofTrainable Parameters 412,388,597 -

The three CNN models were implemented in Python 3.5on a Tensorflow backend and a learning rate of 0.0015 in theAdam optimizer [22]. The batch size for training and testingthe model were adjusted to 20.All HTMLPhish and baseline experiments were conductedon an HP desktop with Intel(R) Core CPU, Nvidia QuadroP600 GPU, and CUDA 9.0 toolkit installed.

Evaluation Metrics

Because of the severely imbalanced nature of our dataset,we evaluated the performance of our models in terms of theArea under the ROC Curve (AUC). We also used the receiveroperating characteristic (ROC) curve in our evaluation. TheROC curve is a probability curve, while the AUC depictshow much the model can distinguish between two classes,which for our model is - legitimate or phishing. The higherthe AUC value, the better the performance of the model.The ROC curve is plotted with the true positive rate (TPR)against the false positive rate (FPR) where

T PR = ( T P )( T P + F N ) and F PR = ( F P )( T N + F P ) . Where TP, FP, TN, and FN stand for thenumbers of True Positives, False Positives, True Negatives,and False Negatives, respectively.Additionally, we employed the precision, True PositiveRate, and F-1 score metrics to evaluate the performance ofHTMLPhish and the baseline models. The True Positive Ratecomputes the ratio of phishing HTML documents that aredetected by the models. In contrast, the precision metricscompute the ratio of detected phishing HTML documentsthat are actual phishes to the total number of detected phish-ing HTML documents. Overall Result

To record the performance of HTMLPhish-Full and the base-line models on the D1 dataset, we split the dataset into 80%for training, 10% for validation, and 10% for testing. Also, tak-ing cognizance of how our data is severely imbalanced, weensured we manually shuffled the datasets before training. The ROC curves of HTMLPhish and its variants are shownin Figure 4. From the result detailed in Table 3, in gen-eral, HTMLPhish-Full significantly outperforms the othertwo variants: HTMLPhish-Character, and HTMLPhish-Word.While HTMLPhish-Character and HTMLPhish-Word havesimilar performances, HTMLPhish-Full takes advantage ofthe strengths of both and produces more consistently betterresults. Also, HTMLPhish-Full offered a significant jumpin AUC over the other variants, while HTMLPhish-Wordperforms slightly worse amongst the three.On the D1 dataset, HTMLPhish-Full provided a 98% accu-racy and 2% False Positive Rate. The minimal False PositiveRates indicates the ratio of legitimate web pages, which areincorrectly identified as a phish. This is helpful when themodel will be deployed in real-world scenarios as users willnot be inappropriately blocked from accessing legitimateweb pages.Considering the computational complexity of HTMLPhish-Full, it can be seen that on a dataset of over 25,000 HTMLdocuments, HTMLPhish-Full can be speedily trained within7 minutes. Once trained, HTMLPhish-Full can evaluate anHTML document in 1.4 seconds.

Comparison with State-Of-The-Art Techniques

We compared HTMLPhish-Full with the methodology, speed,and performance of existing state-of-the-art models in [20]and [28]. [28] is a Deep Neural Network with multiple layersof CNNs that takes as input word tokens from a URL todetermine the maliciousness of the associated web page. Onthe other hand, [20] takes as input the character sequenceof a URL and models its sequential dependencies using Longshort-term memory (LSTM) neural networks to classify aURL as phishing or benign. We applied these techniques tothe HTML documents in the D1 dataset and also tested themon the D2 dataset.From the result detailed in Table 3 and Table 4, HTMLPhish-Full provides better precision, recall and comparable accu-racy against the existing state-of-the-art models. The perfor-mance of HTMLPhish-Word and [28] can be attributed to thefact that it is trained on a definite dictionary of words fromthe training data. Therefore it will be unable to obtain usefulembeddings for new words in the test data. HTMLPhish-Character and [20] perform better with respect to the AUCmetric because the individual character embedding CNN canlearn structural patterns in the HTML document and can alsoobtain feature representations for new words. This makesit easy to be applied to the test data. In addition, due to thelimited number of characters, the scale of the CNN model us-ing the individual embedding character remains fixed whencompared to word-based model sizes. However, CNN modelsbuilt with individual character embeddings cannot exploit able 3: Result of HTMLPhish and Baseline Evaluations on the D1 dataset Models Accuracy Precision True Positive Rates F-1 Score AUC Training timeHTMLPhish-Full 0.98 0.97 0.98 0.97 0.93 6.75 mins

HTMLPhish-Word 0.94 0.93 0.94 0.93 0.88 10 minsHTMLPhish-Character 0.95 0.92 0.95 0.94 0.90 3.5 mins[28] 0.97 0.96 0.97 0.96 0.93 5.25 mins[20] 0.95 0.94 0.95 0.94 0.91 18 mins

Table 4: Result of HTMLPhish and Baseline Evaluations on the D2 dataset

Models Accuracy Precision True Positive Rates F-1 Score AUC Testing timeHTMLPhish-Full 0.93 0.92 0.93 0.91 0.88 9 seconds

HTMLPhish-Word 0.90 0.87 0.91 0.88 0.73 107 secondsHTMLPhish-Character 0.91 0.89 0.91 0.89 0.77 7 seconds[28] 0.91 0.84 0.91 0.87 0.73 15 seconds[20] 0.90 0.90 0.92 0.90 0.78 112 seconds structural information available in long sequences in theHTML document. It also disregards word borders and makesit challenging to differentiate special characters in the data.Furthermore, CNN’s using only character level embeddingstruggles to differentiate information for scenarios wherephishing HTML documents try to imitate benign HTML doc-uments through small modifications to one or few words inthe HTML document[29]. This is because the Convolutionalfilters will likely yield similar output from a sequence of char-acters with a similar spelling. Therefore, CNNs using onlycharacter embeddings are not enough to obtain structuralinformation from the HTML document in detail. That is thereason word embeddings must be taken into account. Con-sequently, HTMLPhish-Full takes advantage of both wordand character embedding matrices to accommodate unseenwords in the test data, and therefore yield a better result thanthe other variants and baseline models. 7 U XH SR V L W L Y H U D W H 52&&XUYH :RUG&KDUDFWHU(PEHGGLQJ$8& &KDUDFWHU(PEHGGLQJ$8& :RUG(PEHGGLQJ$8& Figure 4: The ROC Curve of HTMLPhish Variants

Temporal Resilience

The techniques for implementing a phishing web page iscontinuously evolving due to emerging technology applica-tions for designing phishing web pages. The evaluation ofthe resilience of this evolution is paramount for a phishingweb page detection technique. In this paper, we applied thelongitudinal study [30] by evaluating the accuracy of theHTMLPhish-Full using freshly collected data. This study en-abled us to infer a maximum retraining period, for which theaccuracy of the system does not reduce. For a security sup-plier deploying HTMLPhish-Full in the wild, the retrainingtime frame can provide an approximate cost of maintenance.Using the evaluation metrics detailed above, we comparedthe accuracy of HTMLPhish variants and baseline models onthe training data D1 with its accuracy when applied to thetest data D2 without retraining the model. From the resultsin Table 4, HTMLPhish-Full provided a 98% accuracy on thetraining dataset while yielding a 93% accuracy on the testdataset. The result of our longitudinal study demonstratesthe readiness of HTMLPhish-Full for real-world deployment.HTMLPhish-Full will remain temporally robust, and will notneed retraining within at least two months. In this paper, we proposed HTMLPhish, a deep learningbased data-driven end-to-end automatic phishing web pageclassification approach. HTMLPhish receives the HTML con-tent of a web page as input and applies CNNs to learn thesemantic dependencies in both the characters and words inthe HTML document in a jointly optimized network. Fur-thermore, we applied convolutions on a concatenation ofthe matrix of character and word embeddings in order toensure the effective embedding of new words in the testHTML documents. Our approach can learn context features rom HTML documents without requiring extensive manualfeature engineering.We evaluated our model using a comprehensive datasetof HTML contents presented in a real-world distribution.HTMLPhish provided a high precision rate, showing a tem-porally stable result even when it was trained two monthsbefore being applied to a test dataset.The future work is to compare our model to feature engineering-based models that extract features only from the HTML doc-ument. Also, we intend to implement our model as a browserextension. This will enable HTMLPhish to recognise phish-ing websites in real-time. ACKNOWLEDGMENT

The authors acknowledge the Petroleum Technology Devel-opment Fund (PTDF), Nigeria for the funding and supportprovided for this work.

REFERENCES [1] J. Lopez and J. E. Rubio, “Access control for cyber-physical systemsinterconnected to the cloud,”

Computer Networks , vol. 134, pp. 46–54,2018.[2] O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learningbased phishing detection from urls,”

Expert Systems with Applications ,vol. 117, pp. 345–357, 2019.[3] APWG, “Phishing activity trends report, 1st quarter 2018,” Tech. Rep.,2018.[4] D. Chattaraj, M. Sarma, and A. K. Das, “A new two-server authentica-tion and key agreement protocol for accessing secure cloud services,”

Computer Networks , vol. 131, pp. 144–164, 2018.[5] T. Acar, M. Belenkiy, and A. K¨upc¸¨u, “Single password authentication,”

Computer Networks , vol. 57, no. 13, pp. 2597–2614, 2013.[6] “Google safe browsing,” http://code.google.com/apis/safebrowsing/,accessed: 2019-09-30.[7] C. Amrutkar, Y. S. Kim, and P. Traynor, “Detecting mobile maliciouswebpages in real time,”

IEEE Transactions on Mobile Computing , no. 8,pp. 2184–2197, 2017.[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[9] C. N. Gutierrez, T. Kim, R. Della Corte, J. Avery, D. Goldwasser,M. Cinque, and S. Bagchi, “Learning from the ones that got away:Detecting new forms of phishing attacks,”

IEEE Transactions on De-pendable and Secure Computing , vol. 15, no. 6, pp. 988–1001, 2018.[10] E. Buber, B. Dırı, and O. K. Sahingoz, “Detecting phishing attacks fromurl by using nlp techniques,” in

International Conference on ComputerScience and Engineering (UBMK), 2017 . IEEE, 2017, pp. 337–342.[11] G. Varshney, M. Misra, and P. K. Atrey, “A phish detector using light-weight search features,”

Computers & Security , vol. 62, pp. 213–228,2016.[12] S. Smadi, N. Aslam, and L. Zhang, “Detection of online phishingemail using dynamic evolving neural network based on reinforcementlearning,”

Decision Support Systems , vol. 107, pp. 88–102, 2018.[13] I. Goodfellow, Y. Bengio, and A. Courville,

Deep learning . MIT press,2016.[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778. [15] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning arXiv preprint arXiv:1408.5882 , 2014.[17] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder-decoder for statistical machine translation,” arXiv preprintarXiv:1406.1078 , 2014.[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[19] W. Yin, K. Kann, M. Yu, and H. Sch¨utze, “Comparative study of cnn andrnn for natural language processing,” arXiv preprint arXiv:1702.01923 ,2017.[20] A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A.Gonz´alez, “Classifying phishing urls using recurrent neural networks,”in

Electronic Crime Research (eCrime), 2017 APWG Symposium on .IEEE, 2017, pp. 1–8.[21] H. Le, Q. Pham, D. Sahoo, and S. C. Hoi, “Urlnet: Learning a urlrepresentation with deep learning for malicious url detection,” arXivpreprint arXiv:1802.03162 , 2018.[22] D. Kingma and J. Ba, “Adam: a method for stochastic optimization(2014),” arXiv preprint arXiv:1412.6980 , vol. 15, 2015.[23] L. Richardson,

Beautiful Soup { USENIX } Workshop on Cyber Security Experimentation and Test ( { CSET } ,2018.[25] C. Whittaker, B. Ryner, and M. Nazif, “Large-scale automatic classifi-cation of phishing pages.” in NDSS , vol. 10, 2010, p. 2010.[26] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: a content-basedapproach to detecting phishing web sites,” in

Proceedings of the 16thinternational conference on World Wide Web . ACM, 2007, pp. 639–648.[27] E. Borgida and N. Brekke, “The base rate fallacy in attribution andprediction,”

New directions in attribution research , vol. 3, pp. 63–95,1981.[28] B. Wei, R. A. Hamad, L. Yang, X. He, H. Wang, B. Gao, and W. L.Woo, “A deep-learning-driven light-weight phishing detection sensor,”

Sensors , vol. 19, no. 19, p. 4258, 2019.[29] W. Chu, B. B. Zhu, F. Xue, X. Guan, and Z. Cai, “Protect sensitive sitesfrom phishing attacks using features extractable from inaccessiblephishing urls,” in . IEEE, 2013, pp. 1990–1994.[30] S. Marchal, G. Armano, T. Gr¨ondahl, K. Saari, N. Singh, and N. Asokan,“Off-the-hook: An efficient and usable client-side phishing preventionapplication,”

IEEE Transactions on Computers , vol. 66, no. 10, pp. 1717–1733, 2017., vol. 66, no. 10, pp. 1717–1733, 2017.