[PDF] Evaluation of Federated Learning in Phishing Email Detection

Abstract

The use of Artificial Intelligence (AI) to detect phishing emails is primarily dependent on large-scale centralized datasets, which opens it up to a myriad of privacy, trust, and legal issues. Moreover, organizations are loathed to share emails, given the risk of leakage of commercially sensitive information. So, it is uncommon to obtain sufficient emails to train a global AI model efficiently. Accordingly, privacy-preserving distributed and collaborative machine learning, particularly Federated Learning (FL), is a desideratum. Already prevalent in the healthcare sector, questions remain regarding the effectiveness and efficacy of FL-based phishing detection within the context of multi-organization collaborations. To the best of our knowledge, the work herein is the first to investigate the use of FL in email anti-phishing. This paper builds upon a deep neural network model, particularly RNN and BERT for phishing email detection. It analyzes the FL-entangled learning performance under various settings, including balanced and asymmetrical data distribution. Our results corroborate comparable performance statistics of FL in phishing email detection to centralized learning for balanced datasets, and low organization counts. Moreover, we observe a variation in performance when increasing organizational counts. For a fixed total email dataset, the global RNN based model suffers by a 1.8% accuracy drop when increasing organizational counts from 2 to 10. In contrast, BERT accuracy rises by 0.6% when going from 2 to 5 organizations. However, if we allow increasing the overall email dataset with the introduction of new organizations in the FL framework, the organizational level performance is improved by achieving a faster convergence speed. Besides, FL suffers in its overall global model performance due to highly unstable outputs if the email dataset distribution is highly asymmetric.

Full PDF

FFedEmail: Performance Measurement ofPrivacy-friendly Phishing Detection Enabled byFederated Learning

Chandra Thapa [email protected], CSIRO, Australia

Jun Wen Tang [email protected], CSIRO, AustraliaUNSW, Australia

Sharif Abuadbba [email protected], CSIRO, AustraliaCyber Security CRC

Yansong Gao [email protected], CSIRO, AustraliaCyber Security CRC

Yifeng Zheng [email protected], CSIRO, Australia

Seyit A. Camtepe [email protected], CSIRO, Australia

Surya Nepal [email protected], CSIRO, Australia

Mahathir Almashor [email protected], CSIRO, AustraliaCyber Security CRC

ABSTRACT

Artificial intelligence (AI) has been applied in phishing emaildetection. Typically, it requires rich email data from a collec-tion of sources, and the data usually contains private infor-mation that needs to be preserved. So far, AI techniques aresolely focusing on centralized data training that eventuallyaccesses sensitive raw email data from the collected datarepository. Thus, a privacy-friendly AI technique such asfederated learning (FL) is a desideratum. FL enables learningover distributed email datasets to protect their privacy with-out the requirement of accessing them during the learning ina distributed computing framework. This work, to the bestof our knowledge, is the first to investigate the applicabilityof training email anti-phishing model via FL.Building upon the deep neural network model, in particu-lar, Recurrent Convolutional Neural Network for phishingemail detection, we comprehensively measure and evalu-ate the FL-entangled learning performance under varioussettings, including balanced and imbalanced data distribu-tion among clients, scalability, communication overhead, andtransfer learning. Our results positively corroborate compa-rable performance statistics of FL in phishing email detec-tion to centralized learning. As a trade-off to privacy anddistributed learning, FL has a communication overhead of0.179 GB per global epoch per its clients. Our measurement-based results find that FL is suitable for practical scenarios,where data size variation, including the ratio of phishing tolegitimate email samples, among the clients, are present. Inall these scenarios, FL shows a similar performance of testing accuracy of around 98%. Besides, we demonstrate the integra-tion of the newly joined clients with time in FL via transferlearning to improve the client-level performance. The trans-fer learning-enabled training results in the improvement ofthe testing accuracy by up to 2.6% and fast convergence.

Email is the most usual means of formal communication. Atthe same time, it is exploited as a common tool for phishingattacks, where attackers disguise as a trustworthy entity andtry to install malware or obtain sensitive information suchas login credentials and bank details of the recipient. Basedon the phishing and email fraud statistics 2019 [40], phishingaccounts for 90% of data breaches, which leads to an averagefinancial loss of $3.86 million. Moreover, phishing attackscost American business half a billion dollars a year [28], andit is increasing. Recently, COVID-19 drives phishing emailsup to an unprecedented level by over 600% [32].Correspondingly, there are various techniques devised toprotect users from phishing attacks. These techniques canbe generally divided into two categories, namely traditionalmethods, and artificial intelligence (AI) based methods. Thetraditional method is verification based, where emails arefiltered out by comparing with references, specifically, knownemail formats, which relies on either a list of phishing emails(blacklist), or a list of legitimate emails (whitelist), or the con-tents of the known phishing emails. However, email formatscan be easily manipulated with time by the attackers, whichrenders traditional method inefficient. AI-based methods a r X i v : . [ c s . L G ] J u l hapa, et al. learn to classify an email as phishing or legitimate with ahigh probability (e.g., 0.99848 [11]). This method is contextaware, and it can continuously learn from the newly avail-able email data samples and adapts to handle the new attackformats/cases efficiently on time.AI-based methods can be further broadly classified intotwo parts, namely conventional ML-based and deep learn-ing (DL) based methods. The performance of a conventionalML-based method depends on delicate feature selection (e.g.,semantic and syntax) and processing. Such feature engineer-ing usually requires domain knowledge and trials, so that it istime-consuming and laborious, which hinders improvementsagainst evolving threats. Moreover, it is hard to capture fullcontextual information of email data. Naive Bayes, SupportVector Machine, and Decision Tree are examples of conven-tional ML-based methods [8]. On the contrary, a DL-basedmethod feeds the input directly to the system, and it extractsthe critical features and the contextual information by itself.This contributes to high efficiency as well as better perfor-mance. Convolutional Neural Network [26] and RecurrentConvolutional Neural Network [11] are typical examplesof the DL-based method. Although DL-based methods arepreferable over other methods considering its performanceand automated feature engineering, as a trade-off, it requiresa considerable amount of email data.Unfortunately, emails are sensitive to clients, and disclo-sure to third parties is not preferred. Thus, the organisa-tions or companies are reluctant to share their email datafor the improvement of the anti-phishing DL model. Evenanonymization of the email is problematic because it can beeasily circumvented — attackers can exploit various char-acteristics, e.g., social graphs to re-identify the victim’s en-tity [14]. As such, it is non-trivial to aggregate emails forcentralized analysis. Besides, a recent work [20] emphasizesthe strict ethical concerns when accessing and analyzing theemails of 92 organizations even with the access permission.Along with access control, strict rules are required to befollowed: Firstly, the emails are encrypted during fetching;secondly, only authorized employees at the email analyzingagent can access the data (under the standard, strict accesscontrol policies); thirdly, personally identifying informationor sensitive data disclosed to the authorized employees mustnot be shared with others; once the model is built, all the en-crypted email must be deleted [7]. For any purpose, improp-erly centralized data management could violate specific rulessuch as reusing the data indiscriminately and risk-agnosticdata processing [42] required by General Data ProtectionRegulation (GDPR) [10] and HIPAA [3]. Therefore, even withthe users’ permission to use their data for agreed tasks (e.g.,DL), handling the email data under a centralized cloud isstill risky under the set of privacy regulations. Overall, there is an urgent need to process a DL-based method withoutaccessing the raw email data for anti-phishing purposes.Recently, collaborative learning among a large number ofparticipants has become popular, where a joint DL modelcan be trained by harvesting the rich distributed data held byeach participant without accessing them. One most populartechnique is federated learning (FL) [6, 21, 29]. To the best ofour knowledge, the applicability of FL for email anti-phishinghas not been explicitly investigated. In this paper, we takethe first empirical measurement of email anti-phishing per-formance by leveraging FL. This work examines the following five research questionsthat aim to capture the practical scenarios and challenges.

RQ1 (Distributed email learning) Can FL be appliedto learn from distributed email repositories toachieve a comparable model accuracy as the DLanti-phishing models trained on centralizedemail repository?

Based on the measurements, FLachieves a comparable performance to centralizedlearning. For example, 98.852% of testing accuracy withtwo clients. Though centralized learning has slightlyhigher (e.g., by 0.466%) testing accuracy, it is not pri-vacy friendly. Thus for email learning with privacy,FL can be a suitable technique in phishing detection.Details are provided in Section 4.1.

RQ2 (Scalability) How would the number of clients af-fect FL accuracy and convergence?

Based on themeasurements, an increase in the number of clients inFL has a slightly negative effect on the convergence ofthe accuracy curve and its maximum value. However,more clients (more email datasets) can be available inthe FL model training for phishing detection.

RQ5 ad-dresses some benefits to accuracy. Details are providedin Section 4.2.

RQ3 (Communication overhead) What is the commu-nication overhead resulting from FL?

Based on themeasurements, FL has a communication overhead asa trade-off to privacy. We quantify the overhead andfind it the same of around 0.179GB for all cases of up totwenty clients under our setting. Details are providedin Section 4.3.

RQ4 (Imbalanced data setup) Can we learn from vari-ous clients who have different sizes of localdatasets in FL?

Based on the measurements, FL per-forms well over imbalanced data distribution, includ-ing different phishing to legitimate email ratios. Moreprecisely, the performance is similar despite imbal-anced data distribution among clients, thus making FLsuitable in these scenarios. Details are in Section 4.4. edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning

RQ5 (Transfer learning) Can we utilize the pre-trainedmodel on a similar dataset for a better perfor-mance?

This work implements transfer learning (TL)to improve the client-level performance in the caseswhere clients available with time in the training pro-cess. A fast convergence in the accuracy curve and anincrease in its maximum value are observed with TL.Details are in Section 4.5.

Centralized learning (CL) is normally performed by aggre-gating all available datasets (e.g., phishing and legitimateemails) at one central repository. Then it performs central-ized machine learning on the aggregated dataset. Refer toAlgorithm 1 for more information. During the learning pro-cess, a modeler has access to the raw data, which is shared byone or more clients, thus making it unsuitable if the data isprivate such as email samples. Besides, in the era of big dataand deep learning, it is non-trivial to maintain the requiredresources, including storage and computation, in CL. Thus,recently, there is a rise in distributed learning such as, inparticular, FL. 𝑊 𝑡+1 = ෍ 𝑘 𝑛 𝑘 𝑛 𝑊 𝑡𝑘 Server

Client 3

Client 2Client 1

Client 4 𝑊 𝑡 𝑊 𝑡 𝑊 𝑡3 𝑊 𝑡 𝑊 𝑡+1 Figure 1: An overview of federated learning.

Federated learning [29] allows parallel DL trainingacross distributed clients, and pushes the computation to theedge devices (i.e., clients). Figure 1 illustrates an overview ofFL. There are four exemplified clients with their local emaildatasets and one coordinating server. Firstly, each client i trains the model on their local email datasets D i and producethe local model W it at time instance t , for i ∈ { , , , } . Sec-ondly, all clients upload their local models to the server. Thenthe server performs the weighted averaging (i.e., aggrega-tion) of the local models and updates the global model W t + .Finally, the global model is broadcast to all clients, and thiscompletes the one round of FL process. This process contin-ues until the model converges. In FL, the server synchronizes the training process across the clients. Over the entire train-ing process, only the models (i.e., model parameters) aretransmitted between the clients and the server. Thus, a client(e.g., financial institution) does not require to share their rawemail data to the server (e.g., coordinated by an email ana-lyzer) during the training process. Thus, the data are alwayslocal and confidential that makes FL a privacy-preservingtechnique. D W t D W t ’ t t’ P P P Figure 2: An overview of transfer learning.

Transfer learning [38] utilizes the pre-trained ML modelin the related dataset to the current dataset. It provides afaster convergence and good performance in the currentdataset due to the transfer of the previous knowledge inthe related dataset. Figure 2 illustrates an example overviewof transfer learning in the similar dataset. Firstly, at time t ,model W t trains on the email dataset D with a performance(e.g., accuracy) of P . Secondly, at time t ′ , W t trains on theemail dataset D and evolves to W t ′ , which has a performanceof P on D , and P on D . The transfer learning focuses onthe improvement of the P on D , and does not care about P on D . It is applicable both in CL and FL settings, anduseful to train the model even with fewer data samples. In this work, phishing and legitimate email samples are col-lected from three popular sources, namely First Securityand Privacy Analytics Anti-Phishing Shared Task (IWSPA-AP) [23], NazarioâĂŹs phishing corpora(Nazario) [33], andEnron Email Dataset (Enron) [39]. The dataset contains emailsamples with both header and without header: IWSPA-APhas both types, whereas all email samples in Nazario andEnron have the header accompanied by the body. Overall,the data source includes Wikileaks archives, SpamAssassin,IT departments of different universities, synthetic emails Email header precedes the email body and contains information of theheader fields, including

To, Subject, Received, Content-Type, Return-Path, and

Authentication-Results. hapa, et al.

HeaderBody

Split email Transform Char-level HeaderChar-level BodyWord-level HeaderWord-level Body Train/Test RCNNAttention ClassifyBenignPhishingEmails

Figure 3: An overview of THEMIS model. created by Data engine [35], Enron (emails generated byemployees of Enron corporation), and Nazario (personal col-lection). To provide more insight into the email samples, wepresent some frequently appeared words in them as follows: • IWSPA-AP phishing email (a) with header includes account, PayPal, please, eBay, link, security, update,bank, online, and information , and (b) without headerincludes text, account, email, please, information, click,team, online, and security . IWSPA-AP legitimate email(a) with header includes email, please, new, sent, party,people, Donald, state, and president , and (b) withoutheader includes text, link, national, US, Trump, and democratic . • Nazario includes important, account, update, please,email, security, PayPal, eBay, bank, access, information,item, click, confirm, and service. • Enron includes text, plain, subject, please, email, power,image, time, know, this, message, information, and en-ergy.

We have considered the updated email dataset from thesources till this date, e.g., Nazario’s phishing corpus 2019.In total, the experimental dataset has 23475 email samples,and Table 1 shows the number of emails extracted from eachsource.

Table 1: The number of email samples.

Source Phishing (P) Legitimate (L) P+LIWSPA-AP 1132 9174 10306Nazario 8890 0 8890Enron 0 4279 4279

Total

THEMIS is one of the recent models, which has beendemonstrated to be highly effective for phishing email detec-tion. It employs Recurrent Convolutional Neural Network(RCNNs) and models emails at multiple levels, includingchar-level email header, word-level email header, char-levelemail body, and word-level email body [11]. This way, it cap-tures the deep underlying semantics of the phishing emails efficiently and consequently making THEMIS better thanexisting DL-based methods, that are limited to the naturallanguage processing and deep learning [27].

Model Overview : Fig. 3 illustrates a system overviewof the THEMIS model. Firstly, THEMIS extracts the char-level and word-level of email header and body, and then anembedding layer converts all these levels to the respectivevector representation. Afterward, it feeds each vector repre-sentation into RCNN model [24] and learns a representationfor the email header and email body, respectively. THEMISRCNN consists of four Bidirectional-Long Short-Term Mem-ory (Bi-LSTM) that obtain the left and right semantic infor-mation of a specific location with its embedding informationfrom the above four vectors, thus forming something calleda triple. Next, these triples are mapped into specified dimen-sions using a tanh activation function. The longitudinal maxpolling is then applied to obtain four different representa-tions, which will be paired to form only two representationsfor the header and the body. As the email header representa-tion and body representation have varying degrees of impacton phishing detection, an attention mechanism is appliedto compute a weighted sum of the two representations, andthis produces an ultimate representation of the whole email,which is further processed to produce the classification re-sult. For more details of the THEMIS model, we refer readersto [11].It is reported that THEMIS can achieve up to 99.848%of overall testing accuracy. Considering its high efficacy inphishing detection, we thus chose this model as a CL baselineacross all our FL-based experiment settings.

The email dataset has two types of file formats, viz text file(.txt), and mbox file (.mbox). Each email is a single text file ifthe email sample is in the text format. In the mbox format, allmessages are concatenated and stored as plain text in a singlefile. Moreover, each message starts with the four characters"From" followed by a space. Both types of the email files arefirstly parsed into two parts, namely email header and emailbody, and then subjected to further processing, includingcleaning and tokenization to produce char level and wordlevel sequences (described in the following paragraph). Byconsidering equal phishing and legitimate email samplesfrom the total dataset (of 23475 data samples), we preparethe experimental dataset of size 20044 (i.e., 2 × edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning except for the cases with the imbalanced dataset (where thenumber of phishing and legitimate email samples varies withthe case). For example, cases with five clients have a datasetof size 4008 (i.e., around 20044 divided by 5) in each client.For all experiments, the training-to-testing data split ratio is80:20. The class

Header of thepython module, called email.header [12], is used to extractthe email header, and this separates the header and body partof the email samples. In the header section, we consider onlythe

Subject and the

Content-Type field, which are deemedessential for phishing detection. This separation is done byusing the python library called the regular expression (RE)module [22].

The pythonlibrary

Beautiful Soup 4 [17] and

HTML parser [36] are usedto clean the text information in HTML format. Besides, weuse RE for the plain text (both in header and body) cleaningby removing punctuation and non-alphabetic characters. Tofilter out the stop words from the header and body, we use stopwords of the nltk packages ( nltk.corpus ) [5] of python.

To get the char-level and word-level se-quences of the tokens for both header and body parts, theTokenizer class provided by

Keras library [37] is used. Basi-cally, this is to encode each character/word as a unique inte-ger as required by the input format of the embedding layer.Two main functions are used for tokenization; these are âĂŸ-fit_on_texts,âĂŹ which updates internal vocabulary basedon a list of texts, and âĂŸtexts_to_sequences,âĂŹ whichtransforms each text in texts to a sequence of integers byconsidering only words known by the tokenizer. In all ourmeasurements, we keep 50, 100, 150, and 300 as the length ofthe four sequences of tokens, which are word-level header,char-level header, word-level body, and char-level body, re-spectively.

For the purpose of performance measurements, we use High-performance Computing (HPC) platform that is built on

DellEMC’s PowerEdge platform. It has the

Tesla P100-SXM2-16GB

GPU model. All code is written in Python 3.6.1, and theTHEMIS model, that has a RCNN, is implemented by us-ing TensorFlow 2.2.5 [1] and Keras 2.2.5 [44] framework.In all measurements, we keep the same random seed, i.e.,random.seed(123). We run centralized model training andfederated model training under various settings in our exper-iments, but with the same learning rate of 0.0001 and batchsize of 256. Refer to Algorithm 1 and 2 for details on trainingsteps of CL and FL, respectively.

Algorithm 1:

Centralized learning

Input:

Email dataset ( n = Output:

Model performance (Accuracy, F1-score, Precisionand Recall) /* Runs once at the beginning of the training */ Email dataset preparation: Data extraction: Extract and clean the text header andbody from the raw phishing and legitimate emailsamples; Setup the phishing to legit email data size ratio (e.g.,50:50); Train/test split: Separate the cleaned body and header datainto 80% and 20% training and test dataset, respectively; Tokenization: Conversion of the training and testingdatasets into char-level and word-level sequences of thebody and header, respectively; Initialize THEMIS model W t ; /* Training/testing THEMIS model W t for a global epoch E on the total email dataset with n samples */ for e ∈ E do for b ∈ B do Training and testing W t ; // batch size B = Evaluate training/testing performance; end end To ease the presentation of the measurement results on theexperimental setup (Section 3), we divide this section intofive parts according to the five research questions. Each partaddresses one research question, providing correspondingresults and conclusion/discussion.

RQ 1. Can FL be applied to learn from distributedemail repositories to achieve a comparable model ac-curacy as the DL anti-phishing models trained on cen-tralized email repository?

Figure 4 depicts the model testing and training conver-gences of accuracy curves with the global epoch for CL andFL with two, five, and ten clients. For the observation windowof 45 global epochs, the figure demonstrates a training accu-racy of 99.838% in CL, and 99.794% in FL (with two clients).Besides, the maximum testing accuracy is 99.351% in CL and The accuracy in the THEMIS paper [11] is 99.848%, which is slightly higherthan the accuracy in our experiment. This can be due to various reasons,including email data samples, sample size, and model hyper-parameters.We consider email samples with and without headers, but THEMIS paperstudies only with the header. Besides, our dataset is up to date with 23475samples, whereas THEMIS paper has 8780 samples. hapa, et al.

Algorithm 2:

Federated learning

Input:

Email dataset

Output:

Model performance (Accuracy, F1-score, Precisionand Recall) /* Server-side */ Server: Initialize and send global THEMIS model W t to all K clients; /* Executes for a global epoch E */ for e ∈ E do for each client k ∈ { , , . . . , K } in parallel do W kt ← ClientUpdate ( W kt ) ; // local updates end Perform weighted averaging and update the globalmodel: W t + ← (cid:205) Kk = n k n W kt ; Send the updated global model W t + to all clients; end /* Client-side at each client k */ ClientUpdate ( W kt ) : /* Runs once at the beginning of the training */ Email dataset preparation: Data extraction: Extract and clean the text header andbody from the raw phishing and legitimate emailsamples; Setup the phishing to legit email data size ratio (e.g.,50:50); Train/test split: Separate the cleaned body and headerdata into 80% and 20% training and test dataset,respectively; Tokenization: Conversion of the training and testingdatasets into char-level and word-level sequences ofthe body and header respectively; /* Runs repetitively during the training/testing */ while global THEMIS model W t is received from theserver do Set W kt = W t ; /* Training/testing on the local email datasethaving n k samples */ for b ∈ B do Train the THEMIS model W kt ; // batch size B = Evaluate training/testing performance; end Send locally trained W kt to Server; end RQ1 is affirmative. Besidesprivacy, FL is computationally efficient than CL since thecomputation (ML training/testing) is distributed among theclients. In this setup, we reasonably assume that the emailorganisations (clients) are with resourceful computation tojointly training FL model to preserve the privacy of emailsamples.

RQ 2. How would the number of clients affect FL ac-curacy and convergence?

Figure 4 illustrates the results from the measurement withtwo, five, and ten clients in FL. It shows that the maximumtraining accuracy of 99.794%, 99.576%, and 98.666% accuracyin FL with two, five, and ten clients, respectively. For thetesting set, the maximum accuracy is 98.852%, 97.731%, and96.85% for FL corresponding to two, five, and ten clients, re-spectively. It is not surprising that the convergence is slower,and performance gradually degraded with the increase in thenumber of clients. For more measurement results, refer toFigure 11 in Appendix A.1. It is expected that the number ofclients/organisations participated in anti-phishing is usuallylimited, e.g., ten, correspondingly, the performance degrada-tion resulted from an increasing number of participants isthus constrained.

RQ 3. What is the communication overhead resultingfrom FL?

Herein, we quantify the communication overhead in FLwith two, five, ten, and twenty clients. For the overhead, wemeasure the data uploaded (i.e., a sum of the data packet sizeof W kt and n k ) and download (i.e., data packet size of W t + )to and from the server, respectively, and it is averaged bythe total number of the participating clients. In CL, we donot consider a client-server setup; thus, the communicationoverhead is zero. On the other hand, FL has communicationoverhead as a trade-off to preserve data privacy. For example,in two client setup, the average total communication size perglobal epoch per client is 0.178794768 GB and 0.178794749 GBduring upload to the server and download from the server,respectively. Figure 5 illustrates the results. It shows thatthe communication overhead per client per global epoch isthe same for all cases of up to twenty clients. The consis-tent communication overhead over multiple clients makesFL suitable for distributed training with a large number of edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning (a)(b) Figure 4: Convergence curves of (a) testing accuracyand (b) training accuracy in centralized and federatedlearning (FL) with two, five, and ten clients.Figure 5: Average data communication in centralizedand federated learning (FL) with various number ofclients. clients. Moreover, the overhead can be easily addressed by awell-connected setup with wired or wireless connections be-tween the server and clients participated in the anti-phishingframework. There is a slight difference in the upload size andthe download size because at each client in FL, the downloadis always the weights of the model, whereas the upload isthe weights of the model and the size of the local datasets (which is required to carry out weighted model averaging inserver based on Algorithm 2).

RQ 4. Can we learn from various participants whohave different sizes of local datasets in FL?

So far, we have considered equal data distribution acrossthe clients. Now in this measurement, we examine the perfor-mance of FL under an imbalanced data setup. In this regard,we consider two, five and ten clients, and variation in thelocal data sizes based on the maximum percentage of the vari-ation provided by the term “ var .” For example, if var = [− , − , , + , + ] , where -10% referred to the10% less local data, and +10% referred to the 10% more datain the respective clients. This means, 3606, 3806, 4008, 4208,and 4408 local data samples are resided in clients 1, 2, 3, 4 and5, respectively, if var = Figure 6: Testing accuracy curves showing the impactof different local data sizes provided by different var among clients to their convergence in FL with fiveclients.

The result of our measurement for 10%, 20%, 50%, and80% variations in the sizes of the local dataset in FL amongfive clients is depicted in Figure 6, which shows that theconvergence of the test accuracy curves fluctuate slightlyuntil the global epoch of 10, then remains stable afterwards.All cases with different var maintain an overall testing accu-racy of around 98%. Similar trend is persistent in the trainingphase. Refer to Figures 18 and 19 in Appendix A.3 for more hapa, et al. results. Besides, the convergence trend is also similar forthe cases with two and ten clients. For details, refer to Fig-ure 12, 13 and 14 in Appendix A.2, and Figure 22, 23 and24 in Appendix A.4. The similarity in performance despitevariations in the local data sizes amongst clients indicate theFL’s resilience (mostly enabled by weighted averaging) tothe data size variations.

Figure 7 depicts the results of our measure-ment for FL among five clients having the same size of thelocal dataset but all with (i) 10:90 (first case), (ii) 30:70 (sec-ond case), and (iii) 70:30 (third case) phishing to legitimateemail samples (P/L) ratios. We choose the specific ratios forthe test purpose so that the P/L ratio remains distinct. Thissetup is more practical than the setup with the same P/Lratio, as this has a bias in the samples. The measurementsin this section have var =

0. The figure shows that until theglobal epoch of 15, there is a difference in the performance,where the first case (i.e., 10:90 P/L ratio) with the lower phish-ing email samples was not performing well compared withother cases with higher phishing email samples. However,after the epoch, all cases converge similarly to provide anoverall testing accuracy of around 98% (refer to Figure 21 inAppendix A.3 for more results).For the cases with two and ten clients, the results followthe similar pattern as observed in the case with five clients.However, the performance of the case with 10:90 P/L ratiojumps at different global epochs; jumps after 5 and 27 globalepochs in the cases with two clients and ten clients, respec-tively. For details, refer to Figure 15, 16, 17, 25, and 26 inAppendices A.2 and A.4.

Figure 7: Testing accuracy curves showing the impactof different legit email to phishing email samples ra-tios in the local dataset to their convergence in FL withfive clients under various data distribution settings.

Based on the aforementioned results, the answer to the

RQ4 is affirmative, and FL demonstrates a similar overall performance despite variations in the data distribution acrossclients.

RQ 5. Can we utilize the pre-trained model on the sim-ilar dataset for a better performance?

In this section, we perform three experiments to demon-strate the effects and benefits of transfer learning in phishingdetection in distributed setup, which correspondingly an-swer the above research question (

RQ5 ). In this experiment,we consider five clients in total, where the first four clients(C1–C4) participate in the FL until 15 global epochs and trainthe model collaboratively. Afterward, the transfer learningis carried out only with the fifth client, and the trainingproceeds for the next 15 global epochs (i.e., until 30 globalepochs). In other words, the model is only trained by the fifthclient for the last 15 epochs. Besides, for the performanceevaluation, the testing results are computed for all five clientsthroughout the process. This experiment examines how anewly joining client member can perform FL to improve itsperformance in phishing detection compare to simply usingthe pre-trained model (on a similar dataset) for the detection.The experimental result depicted in Figure 8 is for the casewith var =

80, which provides the variations in the sizesof the local dataset (i.e., [-80%, -40%, 0%, +40%, +80%]) tocapture a practical setting among the five clients. The figureshows that the average test accuracy of the first four clientsis slightly higher than the fifth client (not participated inthe learning process) until 15 global epochs. Afterward, theaverage testing accuracy of the fifth client improves by 2.6%than the others since its training dataset trains the model.This performance decreases with the lesser variation in thesizes of the local dataset; for example, the improvement isonly 0.84% with var =

0. For more details, refer to Figure 27,29 and 30 in Appendix A.5. Overall results show that theevolved model (after training by client 5) is still relevant tothe first four clients (C1–C4) as their average testing resultswith and without client 5 differ only nominally. Nonetheless,the fifth client boosts the accuracy of phishing detection inits local dataset by performing transfer learning under theFL setup.

In this experiment, thelearning process is started with the first client, and then onenew client is joined continuously at an interval of 10 globalepochs as the training proceeds. Refer to Table 2 for details.This experiment simulates the practical cases where more edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning (a)(b)

Figure 8: Convergence curves of (a) testing accuracyand (b) training accuracy from the Experiment 1 withfive clients and var = . The first four clients train themodel until 15 global epochs, and then (only) the fifthclient trains the model. than one client (different than the Experiment 1) is avail-able with time during model training and demonstrates howthe newly available clients can continue to perform FL tocontribute accuracy improvements for phishing detection. Table 2: Experimental steps for the Experiment 2Round Involvement of clients (a)(b)

Figure 9: Convergence curves of (a) testing accuracyand (b) training accuracy from the Experiment 2 withfive clients and var = . The FL training starts with oneclient, i.e., client 1, and a new client joins the train-ing at every 10-th global epochs in a sequence fromclient 2 to client 5. are gradually added to the learning process, as stated inTable 2. As per expectation, it shows that the testing accuracyimproves for each client when it is added to FL via transferlearning. For example, the average testing accuracy jumps byaround 13% for client 2 when it joins client 1 in training themodel at global epoch 10. The testing performance is carriedout for all clients; however, the training result is carriedonly when the client is involved in the model training. Thus,the accuracy before a client joins the training is zero inFigure 9(b). The performance pattern is similar for the casewith var =

80 (refer to Figure 31 in Appendix A.6).

In this ex-periment, we analyze the performance of client 1, whichwe assume a newly participating client, with and withouttransfer learning. For this, we consider five clients with a hapa, et al. variation in their dataset provided by var =

80 (this meansthat client 1 has 80% fewer data samples than client 3). Inthe transfer learning setup, the model is firstly trained bythe four clients (client 2 to client 5) for 15 global epochs,and then the resulting model (pre-trained model) is furthertrained by client 1 on its local email data samples. On theother hand, for the case without transfer learning, client 1performs CL on its local email dataset. (a)(b)

Figure 10: Convergence curves of (a) testing accuracyand (b) training accuracy of client 1 in transfer learn-ing and centralized learning.

The experimental result is depicted in Figure 10. It showsthat transfer learning outperforms CL along with a fast con-vergence for client 1. Moreover, at the global epoch of 45,the testing accuracy in transfer learning is 1.87% higher thanCL. For more results, refer to Figure 32 in Appendix A.7.Based on the results from the above three experimentsin this section, it is clear that transfer learning is useful forthe performance boosting in phishing detection under thefederated setup.

A centralized email analysis based on AI-based methodsfor phishing detection has been explored for a long time.Conventional ML-based techniques such as decision trees,logistic regression, random forests, AdaBoost, and supportvector machines are analyzed in phishing detection [2, 4,18, 45–47]. These techniques are based on feature engineer-ing, which requires in-depth domain knowledge and trials.On the other hand, DL-based methods include deep neu-ral networks [43], convolutional neural networks (CNNs)[26], deep belief networks [50], bidirectional LSTM with su-pervised attention [34], and recurrent convolutional neuralnetworks [11]. These works are mostly based on natural lan-guage processing techniques for phishing detection. Whilemost existing works have focused on the effective detectionof general phishing emails, there are few works that con-sider specialised phishing attacks, including spear phishingattacks [15] and business email compromise attacks [7] inspecific contexts. Despite the usefulness, all the above worksoperate under a setting where emails must be centralized foranalysis and thus do not provide privacy protection of emaildatasets.

There have been attempts on cryptographic approaches forsupporting DL model training over encrypted data, whichapplies to the training of deep neural network models forphishing email detection while preserving privacy. In [31],Mohassel and Zhang propose the first system design Se-cureML for privacy-preserving neural network training. Intheir system, multiple data providers can secretly share theirdata among two cloud servers, which will then conduct thetraining procedure over the secret-shared data. They relyon the secure computation techniques, e.g., secret sharingand garbled circuits, to design a secure two-party computa-tion protocol, allowing two cloud servers to compute in theciphertext domain the linear operations (addition and multi-plication) as well as the non-linear activation functions. Later,Wagh et al. [48] propose a design that works in the three-server model and is purely based on the lightweight secretsharing technique, with better performance than SecureML.This work assumes an adversary model where none of thethree cloud servers will deviate from the protocol. The workin [30] also operates under a similar three-server setting, yetachieves more robust security against malicious adversarieswho deviate arbitrarily. This line of work presents valuableresearch endeavours in enabling deep neural network train-ing over encrypted data. Yet, it has to rely on additionalarchitectural assumptions (i.e., non-colluding cloud servers) edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning and also incurs substantial performance overheads (up toorders of magnitude slower) when compared to the plaintext baseline.

FL is attractive, especially when the data is sensitive, likein the financial sector (banks) and the medical sector (hos-pitals) [41]. There have been several works in FL thoughnone of them specifically address phishing email detection.Some works include the following: Google has used FL fornext-word prediction in a virtual keyboard for smartphoneswords [19], Leroy et al. applied FL for speech keyword spot-ting [25], Gao et al. [13] propose to use FL to train a jointmodel over heterogeneous ECG medical data to preserve thedata privacy of each party, and Yang et al. [49] applied FL todetect credit card fraud.

There are various other techniques such as homomorphicencryption [16] (a cryptographic approach) and differentialprivacy [9] used along with

FL for guaranteed and provableprivacy, respectively. However, homomorphic encryptionincreases computational overhead, and differential privacydegrades the performance as a trade-off. The integrationof these techniques to FL in phishing detection remains asfuture work.

This work took the first step to implement federated learn-ing (FL) for privacy-preserving email phishing detection.Built upon the state-of-art deep learning model that is deli-cately designed for email phishing detection as a centralizedlearning baseline, our comprehensive measurements underFL demonstrated promising results while preserving the pri-vacy of the email content. More specifically, the deep learningmodel performance under FL was as good as that of central-ized learning under various practical scenarios, includingimbalanced data distribution among clients. Besides, thiswork leveraged transfer learning to enable fast convergenceof accuracy curves and improved accuracy in client-levelphishing detection.In FL, the email data samples always reside in the emaildata custodians, e.g., participating organizations in the phish-ing detection from different geographic locations, and data isnot shared among the participants. Thus, considering this in-herent privacy-preservation feature, it potentially unleashesthe willingness of more clients (thus more data) contributingto the deep learning in email phishing detection to improvethe model performance, including the client-level perfor-mance, by harnessing data integration in FL as illustrated inthis paper.

The work is partially supported by the Cyber Security Co-operative Research Centre, Australia. Authors acknowledgeProfessor Rakesh Verma from the University of Houston,USA, for the IWSPA-AP corpus.

REFERENCES [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, and AndyDavis ... Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-ScaleMachine Learning. In

Proc. of the Anti-Phishing Working Groups 2nd Annual eCrime Re-searchers Summit , Vol. 269. ACM, 60–69.[3] HIPAA Compliance Assistance. 2003. Summary of the HIPAA privacyrule.

Office for Civil Rights (2003).[4] André Bergholz, Jan De Beer, Sebastian Glahn, Marie-Francine Moens,Gerhard Paaß, and Siehyun Strobel. 2010. New filtering approachesfor phishing email.

Journal of Computer Security

CoRR abs/1902.01046 (2019).http://arxiv.org/abs/1902.01046[7] Asaf Cidon, Lior Gavish, Itay Bleier, Nadia Korshun, MarcoSchweighauser, and Alexey Tsitkin. 2019. High precision detection ofbusiness email compromise. In

Proc. 28th USENIX Security Symposium .1291–1307.[8] Emmanuel Gbenga Dada, Joseph Stephen Bassi, Haruna Chiroma,Shafi’i Muhammad Abdulhamid, Adebayo Olusola Adetunmbi, andOpeyemi Emmanuel Ajibuwa. 2019. Machine learning for email spamfiltering: review, approaches and open research problems.

Heliyon

Foundations and Trends in Theoretical ComputerScience

IEEE Access

World Wide Web

21, 6 hapa, et al. (2018), 1759–1771.[15] Hugo Gascon, Steffen Ullrich, Benjamin Stritter, and Konrad Rieck.2018. Reading Between the Lines: Content-Agnostic Detection ofSpear-Phishing Emails. In

Proc. of RAID , Michael Bailey, ThorstenHolz, Manolis Stamatogiannakis, and Sotiris Ioannidis (Eds.). 69–91.[16] Craig Gentry. 2009.

A fully homomorphic encryption scheme

IEEE Trans. Dependable Secur. Comput.

15, 6 (2018),988–1001.[19] Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy,Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon,and Daniel Ramage. 2018. Federated learning for mobile keyboardprediction. (2018). https://arxiv.org/abs/1811.03604[20] Grant Ho, Asaf Cidon, Lior Gavish, Marco Schweighauser, Vern Paxson,Stefan Savage, Geoffrey M Voelker, and David Wagner. 2019. Detectingand characterizing lateral phishing at scale. In { USENIX } SecuritySymposium ( { USENIX } Security 19) . 1273–1290.[21] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet,Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles,Graham Cormode, Rachel Cummings, et al. 2019. Advances and openproblems in federated learning. (2019). https://arxiv.org/abs/1912.04977[22] Secret Labs. [n.d.]. re âĂŤ Regular expression operations. https://docs.python.org/3/library/re.html Accessed: Jan 18, 2020.[23] ReDAS Lab@UH. [n.d.]. First Security and Privacy Analytics Anti-Phishing Shared Task (IWSPA-AP 2018). https://dasavisha.github.io/IWSPA-sharedtask/ Accessed: Jan 16, 2020.[24] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convo-lutional neural networks for text classification. In

Twenty-ninth AAAIconference on artificial intelligence .[25] David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, andJoseph Dureau. 2019. Federated learning for keyword spotting. In

ICASSP 2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 6341–6345.[26] Hiransha M, Nidhin Unnithan, Vinayakumar R, and Soman Kp. 2018.Deep Learning Based Phishing E-mail Detection CEN-Deepspam. In

Proc. 1st AntiPhishing Shared Pilot at 4th ACM IWSPA .[27] Hiransha M, Nidhin A Unnithan, Vinayakumar R, and Soman KP.2018. Deep Learning Based Phishing E-mail Detection. In

Proc. 1stAntiPhishing Shared Pilot at 4th ACM IWSPA

Journal of Machine Learning Research

Proceedings of the 2018 ACMSIGSAC Conf. on Computer and Communications Security, CCS . 35–52.[31] Payman Mohassel and Yupeng Zhang. 2017. SecureML: A System forScalable Privacy-Preserving Machine Learning. In

Proc. of IEEE S&P

Proc. 1st AntiPhishing Shared Pilot at 4th ACM IWSPA

IEEETransactions on Knowledge and Data Engineering

Proceedings of the 11th USENIX Conference on Hot Topics in CloudComputing (Renton, WA, USA) (HotCloud’19) . USENIX Association,Berkeley, CA, USA, 1–1. http://dl.acm.org/citation.cfm?id=3357034.3357036[43] Sami Smadi, Nauman Aslam, and Li Zhang. 2018. Detection of on-line phishing email using dynamic evolving neural network based onreinforcement learning.

Decis. Support Syst.

107 (2018), 88–102.[44] Keras team. [n.d.]. Keras: The python deep learning library. https://keras.io/ Accessed: Jan 18, 2020.[45] Nidhin Unnithan, Harikrishnan NB, Vinayakumar R, Soman Kp, andSai Sundarakrishna. 2018. Detecting Phishing E-mail using Machinelearning techniques CEN-SecureNLP. In

Proc. 1st AntiPhishing SharedPilot at 4th ACM IWSPA .[46] Anu Vazhayil, NB Harikrishnan, R Vinayakumar, KP Soman, and ADRVerma. 2018. PED-ML: Phishing email detection using classical ma-chine learning techniques. In

Proc. 1st AntiPhishing Shared Pilot at 4thACM IWSPA . 1–8.[47] Rakesh M. Verma, Narasimha Shashidhar, and Nabil Hossain. 2012.Detecting Phishing Emails the Natural Language Way. In

Proc. ofESORICS , Vol. 7459. 824–841.[48] Sameer Wagh, Divya Gupta, and Nishanth Chandran. 2019. SecureNN:3-Party Secure Computation for Neural Network Training.

PoPETs

International Conference on Big Data . Springer, 18–32.[50] Jiahua Zhang and Xiaoyong Li. 2017. Phishing detection methodbased on borderline-smote deep belief network. In

Proc. of Interna-tional Conference on Security, Privacy and Anonymity in Computation,Communication and Storage . 45–53. edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning

A SUPPLEMENTAL RESULTSA.1 Distributed learning and Scalability(additional result)

Figure 11:

Testing results providing the precision, recall,and F1-score at the global epoch of 45 in centralized and fed-erated learning (FL) with two, five, and ten clients.

A.2 Imbalanced data setup - two clients(additional cases)

Figure 12:

Testing accuracy curves showing the impact ofdifferent local data sizes provided by different var amongclients to their convergence in federated learning with twoclients.

Figure 13:

Training accuracy curves showing the impact ofdifferent local data sizes provided by different var amongclients to their convergence in federated learning with twoclients.

Figure 14:

Testing results providing the precision, recall,and F1-score at the global epoch of 45 in federated learningwith two clients under various data distribution settings cre-ated by different var . Figure 15:

Testing accuracy curves showing the impact ofdifferent legit email to phishing email samples ratios in thelocal dataset to their convergence in federated learning withtwo clients under various data distribution settings.

Figure 16:

Training accuracy curves showing the impact ofdifferent legit email to phishing email samples ratios in thelocal dataset to their convergence in federated learning withtwo clients under various data distribution settings. hapa, et al.

Figure 17:

Testing results providing the precision, recall,and F1-score at the global epoch of 45 in federated learningwith two clients under various phishing to legitimate emailsamples ratios (P:L) in the local dataset.

A.3 Imbalanced data setup - five clients(additional results)

Figure 18:

Training accuracy curves showing the impact ofdifferent local data sizes provided by different var amongclients to their convergence in federated learning (FL) withfive clients.

Figure 19:

Testing results providing the precision, recall,and F1-score at the global epoch 45 in federated learningwith five clients under various data distribution settings cre-ated by different var . Figure 20:

Training accuracy curves showing the impact ofdifferent legit email to phishing email samples ratios in thelocal dataset in federated learning with five clients undervarious data distribution settings.

Figure 21:

Testing results providing the precision, recall,and F1-score at the global epoch of 45 in federated learningwith five clients under different phishing to legitimate emailsamples ratios (P:L) in the local dataset.

A.4 Imbalanced data setup - ten clients(additional cases)

Figure 22:

Testing accuracy curves showing the impact ofdifferent local data sizes provided by different var amongclients to their convergence in federated learning with tenclients. edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning

Figure 23:

Training accuracy curves showing the impact ofdifferent local data sizes provided by different var amongclients to their convergence in federated learning with tenclients.

Figure 24:

Testing results providing the precision, recalland F1-score at the global epoch of 45 in federated learningwith ten clients under various data distribution settings cre-ated by different var . (a)(b) Figure 25: (a) Testing accuracy and (b) training accuracycurves showing the impact of different legit email to phish-ing email samples ratios in the local dataset to their conver-gence in federated learning with ten clients under variousdata distribution settings.

Figure 26:

Testing results providing the precision, recall,and F1-score at the global epoch of 45 in federated learningwith ten clients under various phishing to legitimate emailsamples ratios (P:L) in the local dataset. hapa, et al.

A.5 Transfer learning - Experiment 1(additional cases)

Figure 27:

Convergence curves of testing accuracy fromthe Experiment 1 with five clients and var = . The firstfour clients train the model until 15 global epochs, and then(only) the fifth client trains the model. Figure 28:

Convergence curves of training accuracy fromthe Experiment 1 with five clients and var = . The firstfour clients train the model until 15 global epochs, and then(only) the fifth client trains the model. (a)(b) Figure 29:

Convergence curves of (a) testing accuracy and(b) training accuracy from the Experiment 1 with five clientsand var = . The first four clients train the model until15 global epochs, and then (only) the fifth client trains themodel. edEmail: Performance Measurement of Privacy-friendly Phishing Detection Enabled by Federated Learning A.6 Transfer learning - Experiment 2(additional case) (a)(b)

Figure 31:

Convergence curves of (a) testing accuracy and(b) training accuracy from the Experiment 2 with five clientsand var = . In federated learning, the training starts withone client, i.e., client 1, and a new client joins the trainingat every 10-th global epochs in a sequence from client 2 toclient 5. A.7 Transfer learning - Experiment 3(additional result)