[PDF] DeepQuarantine for Suspicious Mail

Abstract

In this paper, we introduce DeepQuarantine (DQ), a cloud technology to detect and quarantine potential spam messages. Spam attacks are becoming more diverse and can potentially be harmful to email users. Despite the high quality and performance of spam filtering systems, detection of a spam campaign can take some time. Unfortunately, in this case some unwanted messages get delivered to users. To solve this problem, we created DQ, which detects potential spam and keeps it in a special Quarantine folder for a while. The time gained allows us to double-check the messages to improve the reliability of the anti-spam solution. Due to high precision of the technology, most of the quarantined mail is spam, which allows clients to use email without delay. Our solution is based on applying Convolutional Neural Networks on MIME headers to extract deep features from large-scale historical data. We evaluated the proposed method on real-world data and showed that DQ enhances the quality of spam detection.

Full PDF

DDeepQuarantine for Suspicious Mail

Nikita Benkovich, Roman Dedenok, and Dmitry Golubev

Kaspersky, Moscow 125212, Russia { Nikita.Benkovich, Roman.Dedenok, Dmitry.S.Golubev } kaspersky.com Abstract.

In this paper, we introduce DeepQuarantine (DQ), a cloudtechnology to detect and quarantine potential spam messages. Spam at-tacks are becoming more diverse and can potentially be harmful to emailusers. Despite the high quality and performance of spam ﬁltering sys-tems, detection of a spam campaign can take some time. Unfortunately,in this case some unwanted messages get delivered to users. To solve thisproblem, we created DQ, which detects potential spam and keeps it ina special Quarantine folder for a while. The time gained allows us todouble-check the messages to improve the reliability of the anti-spam so-lution. Due to high precision of the technology, most of the quarantinedmail is spam, which allows clients to use email without delay. Our solutionis based on applying Convolutional Neural Networks on MIME headersto extract deep features from large-scale historical data. We evaluatedthe proposed method on real-world data and showed that DQ enhancesthe quality of spam detection.

Keywords: spam ﬁltering · spam detection · machine learning · deeplearning · cloud technology. Nowadays it is hard to imagine a life without e-mail communication, particularlyin business area. The growth of e-mail’s popularity is accounted for low cost andhigh eﬀectiveness of exchanging messages. The same factors contribute to theincreasing amount of spam. According to a report by Kaspersky [14], the averagepercentage of spam in the global mail traﬃc in Q1-Q2 2019 was 57.64%, up 1.67p.p. compared to the previous reporting period. The largest share of spam wasrecorded in May (58 . a r X i v : . [ c s . CR ] J a n N. Benkovich et al. decisions must be very reliable, which obviously reduces detection rate. To solvethis problem, commercial anti-spam products delay potential spam messages torecheck them after a certain time to improve the reliability of the anti-spamsolution. The Axway Inc. in [5] described delay technique in e-mail ﬁlteringsystem, which provides a store and the transmission path of quarantined data.This mechanism was established reliable and now its diﬀerent modiﬁcations areused in many companies such as Cisco, Barracuda and others.In this paper, we describe a novel approach to quarantine messages. Ourwork focuses on applying Deep Learning [13] techniques on MIME (MultipurposeInternet Mail Extensions) [12] headers to classify potential spam. Unlike mostresearch papers, our solution does not process body content of a message. Theproposed architecture has three inputs: a char sequence of Message-Id, a sequenceof headers and X-Mailer. For extracting information from sequential data, we useone-dimensional convolutional neural network (CNN). This method was appliedon characters to text classiﬁcation [16]. It has been shown that this approach canbe competitive to traditional solutions for example with a simple long-short termmemory net (LSTM) [8]. Moreover, CNNs do not depend on the computationsof the previous states unlike LSTM. This fact aﬀects model performance, whichis extremely important in real-time services.We evaluated our approach on a large-scale dataset. In the experiments, weshowed that combination of our proposed model and traditional spam ﬁltersimproves in classiﬁcation rates.

Cybercriminals continue to look for new ways to spread spam and improve pre-vious techniques. Traditional signature approaches are becoming less eﬀectivecompared to previous years. The reasons are poor generalization ability and theneed to use human resources to ﬁnd new attacks and develop signatures to blockthem.Machine learning techniques have recently become very eﬀective to ﬁghtspam. Most research papers propose diﬀerent methods to handle body contentof a message. [6] suggested a defence strategy against poisoning attacks, whenspammers enrich messages with legitimate words to defeat ﬁlters. They showedthat bagging ensembles could be very promising in this task. In [4] authors ap-plied deep learning and transfer learning techniques to detect diﬀerent attackssuch as phishing, social engineering, propaganda and others. [6] demonstrated aphishing content classiﬁer based on a recurrent neural network.There are also related works that use non-content features for spam detec-tion. [15] noted that message headers are a powerful source of features for spamﬁltering. The experiments showed that using only features from headers couldachieve comparable or better performance than where using body content. In [11]and [9] authors proposed hand-crafted methods to extract features from e-mailheaders, and evaluated performance of various machine learning classiﬁers usinga prepared corpus. eepQuarantine for Suspicious Mail 3

Publicly available benchmark datasets on e-mail spam highlighted in [1] arenot regularly updated thus do not reﬂect actual threats. Publication of realemail collections is almost impossible since this data is associated with numerousconﬁdential and legal restrictions. Moreover, available datasets can be highlybiased because they contain conversations between a small group of users. Forexample, the popular Enron corpus [10] is deemed to be in the public domainas the result of an investigation after the company’s collapse and contains onlycommunications between Enron employees. These factors complicate research inthis area and the adaptation of the proposed methods in the real world.

In this section, we introduce the design of DQ. We describe three main parts ofthe new technology. First, we focus on backend logic, which is responsible formessage transactions and the system-customer relationship. Then we illustratepreprocessing of message headers. Finally, we show design of model for spamclassiﬁcation.

According to Figure 1a messages in origin-based scheme are processed by compli-cated system of spam ﬁlters before delivered to user. Moreover, spam ﬁlters areregularly updated because statistical properties of spam campaigns change overtime. Indeed, this approach can be used and potentially provides high detectionrate. In real life, most missed spam is detected shortly after updating ﬁlters.Unfortunately, the considered scheme delivers these messages to users becausespam decisions are made once when a message is received.To solve this problem we implemented DQ, illustrated on Figure 1b. DQis a cloud technology, which provides request-response logic with an installedanti-spam service on user’s machines. The main objective of DQ is classiﬁcationof potential spam. After a messages passed through ﬁlters, the service sendsa request to DQ with message headers and waits for a response. Meanwhile,DQ handles input data and returns true if message should be delayed or false otherwise. Of course, in real life organization of this communication to processthe big data that accumulates from diﬀerent user nodes is not a trivial task. Wedo not go deep into implementation details and focus on logic of the technology.As shown on Figure 1b, suspicious mail is put in the Quarantine folder for awhile, others are delivered to user. When the time is over, quarantined messagespass through ﬁlters again. It should be noted that DQ only receives requiredheaders and returns the quarantine decision, all delayed mail in Quarantinefolder is located on the user PC.The proposed scheme allows to gain the time to update ﬁlters and double-check suspicious messages to improve the reliability of the anti-spam solution.Moreover, this implementation provides a low-cost way to update the model thatis extremely important to adapt to new spam tactics or changing mail transferprotocols.

N. Benkovich et al.(a) Origin-based (b) DQ implementation

Fig. 1.

Backend of spam-ﬁltering systems

It is well known that the feature selection plays a big role in model performance.The e-mail provides a large amount of information about a sender and contentof the message. Some of this data can be absolutely useless and add unwantednoise that can be a reason of lower model performance. Our solution is based onnon-content classiﬁcation. Due to this fact, we are able to transfer data to cloudservice and collect this type of information using simpler way than in content-based case. Another important aspect is the ability to quickly extract featuresfrom message. As far as DQ is a real-time service, performance is very importantto ensure email communication without delay.At the moment, the model takes: Message-ID, a sequence of message headers(HeaderSeq) and X-mailer. To bypass protection systems and spread maliciousmail, spammers often use their own Mail User Agent (MUA). MUAs are re-sponsible for preparing email messages for transferring to a Mail Transfer Agent(MTA). One of the MUA tasks is to create and ﬁll correct MIME headers. Someof attackers ignore it and can use random content for headers. Others try tofake headers to make them look like real ones. We focus on Message-ID andHeaderSeq for several reasons. Firstly, these features have non-trivial structure.Secondly, the form of Message-Id and the order of headers in HeaderSeq can varydepending on the type of MUA, which creates a tight connection between fea-tures. These facts make compromising more diﬃcult, which helps the model todetect spam. We also added X-mailer to deﬁne MUA. Below we describe featuresand their representation for the classiﬁer. eepQuarantine for Suspicious Mail 5

The Message-ID provides an identiﬁer for messages and looks like a sequenceof US-ASCII characters between an angle bracket pair. For example: (cid:104) (cid:105)

Message-ID consists of two parts splitted by @. The left part of the Message-ID is a hash that has a speciﬁc structure for diﬀerent MUAs. The right partis a domain. The Message-ID is transformed to a tensor size l , where each rowvector is a char embedding. For encoding, we build a vocabulary that maps US-ASCII chars (without special characters) to trainable embeddings. In addition,we added two symbols < EOS > for the end of a string and < U N K > forunknown characters. In case the length of Message-ID is greater than l , the ﬁrst l -characters are taken. In case length of Message-ID is less than l , the sequenceis ﬁlled with < EOS > to the length l . The HeaderSeq is a sequence of MIME headers in the message. The order ofheaders can vary depending on the type of MUA. The encoding of HeaderSeqhas the same scheme as the Message-ID. The only diﬀerence is that we operatewith header names, not characters. For example: subject:from:to:date:message-id:content-type: is a possible HeaderSeq. The ﬁnal representation is a tensor with ﬁxed shapewhere each row is an encoded header. The number of rows was estimated fromstatistics as a 95-percentile of HeaderSeq length.

The X-Mailer is the name of a MUA. Before encoding, we preprocess theX-Mailer to get information only about the type of MUA. For an actual e-mailprogram, we drop information about version and release. For example:

Microsoft Windows Live Mail 14.0.8117.416 is transformed to

Microsoft . This helps signiﬁcantly reduce the size of the featurespace. We also conducted experiments that used the name and version of MUA,but this did not increase performance. For an unknown e-mail program, wecreated a special category. The encoding is done by using one-hot encoding . In this section, we describe the architecture of the spam classiﬁer. Despite the factthat DQ does not block messages, we cannot delay all of them for re-checking,because this signiﬁcantly increase the delivery of an e-mail. Moreover, to ensurethat e-mail work without delays, DQ has a time limit for the response. If thetime is over, the message is delivered to the user without applying DQ. For thesereasons, we have a trade-oﬀ between model complexity and computation time.

N. Benkovich et al.

Figure 2 demonstrates the model architecture. Following [16], for Message-IDand HeaderSeq we applied a temporal CNN to extract features from sequentialdata. This kind of CNN applies convolutional ﬁlters along one dimension andcapture all units from others. Also we used the one-dimensional version of themax-pooling module applied in [3].

Fig. 2.

Architecture of spam classiﬁer

For the Message-ID we designed a subnet with four temporal convolution lay-ers with a ﬁxed number of ﬁlters for each of them. We applied relu as activationfunction. Initially we use a layer with biggest ﬁlter size to extract informationfrom longer subsequences. After the ﬁrst and last layers, we inserted a temporalmax-pooling layer to ensure stability of training. In the HeaderSeq branch, weused two layers: a temporal convolution layer and a temporal max-pooling layer.The shallow architecture is the result of a small length of HeaderSeq. The out-puts from the convolutional nets are concatenated with the encoded X-Mailer toa one-dimensional tensor as illustrated in Figure 2. Finally, we added two fullyconnected layers and inserted a dropout [7] between them for regularization. Weused sigmoid activation to obtain the probability of spam for model’s output.

In many research papers it is stated that a CNN usually requires large-scaledatasets to work and achieve competitve performance in diﬀerence areas. Unfor- eepQuarantine for Suspicious Mail 7 tunately, public datasets for spam classiﬁcation are fairly small and do not showactual threats because they are not regularly updated.In this work, we used a collection that consists of metadata from tens of mil-lions of real-time e-mail scans. We split the data into training and test datasetsby timestamp to avoid leaking information from the future into the past. Wesampled 120 million objects for training and 40 million objects for testing. Inboth datasets, the proportion of spam is about 40 percent. We optimized weightsof the model using SGD with momentum of 0 . Fig. 3.

Precision-recall curve computed on the test dataset

We show the PR-curve in Figure 3 to demonstrate the model performanceon the test data. As mentioned earlier, a classiﬁer should have high precisionto deliver legitimate e-mail messages without delay. We deﬁned a probabilitythreshold for which the precision is equal 0 .

998 and the recall is 0 . This article proposes a non-content-based classiﬁcation approach to delay po-tential spam messages in real time. On the one hand, we demonstrated a novelfeature set and way to handle it for a spam classiﬁcation task. On the otherhand, we show that this method is well-suited for enterprise solutions because ithas a simple update scheme, high performance and a low false positive rate. Fur-thermore, combining this technology with resource-intensive checks that require

N. Benkovich et al. additional time for veriﬁcation/response (such as a Whois requests, in-depth con-tent veriﬁcation, etc), we can get a fast and cost-eﬀective system for detectingspam messages.

References

1. Bhowmick, A., Hazarika, S. M.: Machine Learning for E-mail Spam Filtering: Re-view Techniques and Trends. arXiv preprint arXiv:1606.01042 (2016)2. Biggio, B., Corona, I., Fumera, G., Giacinto, G., Roli, F.: Bagging Classiﬁers forFighting Poisoning Attacks in Adversarial Classiﬁcation Tasks. In: MCS 2011,LNCS, vol. 6713, pp. 350-359, Springer,Verlag (2011)3. Boureau, Y.-L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features forrecognition. In: Computer Vision and Pattern Recognition (CVPR), IEEE Confer-ence on, pp. 2559-2566, IEEE, Finland (2010)4. Dhamani, N., Azunre, P., Gleason, J. L., Corcoran, C., Honke, G., Kramer, S.,Morgan J.: Using Deep Networks and Transfer Learning to Address Disinformation.arXiv preprint arXiv:1905.10412 (2019)5. Google Patents Delay technique in e-mail ﬁltering system, https://patents.google.com/patent/US20090157708A1/en . Last accessed 15 Sep 20196. Halgas, L., Agraﬁotis, I., Nurse, J.: Catching the Phish: Detecting Phishing Attacksusing Recurrent Neural Networks (RNNs). In: 20th World Conference on Informa-tion Security Applications, Korea (2019)7. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. R.:Improving neural networks by preventing coadaptation of feature detectors. arXivpreprint arXiv:1207.0580 (2012)8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),1735-1780 (1997)9. Hu, Y., Guo, C., Ngai, E., Liu, M., Chen, S.: A Scalable Intelligent Non-content-based Spam-ﬁltering Framework. Expert Systems with Applications 37(12), 8557-8565 (2010)10. Klimt, B., Yang, Y.: The Enron corpus: A new dataset for email classiﬁcationresearch. In: Proceedings of the 15th European Conference on Machine Learning,pp. 217-226, Pisa, Italy. (2004)11. Qaroush, A., Khater, I. M., Washaha, M.: Identifying spam e-mail based-on sta-tistical header features and sender behavior. In: CUBE International InformationTechnology Conference, pp. 771-778, ACM, USA (2012)12. RFC 1521 Mechanisms for Specifying and Describing the Format of Internet Mes-sage Bodies, https://tools.ietf.org/html/rfc1521 . Last accessed 15 Sep 201913. Schmidhuber, J.: Deep Learning in Neural Networks: An Overview. Neural Net-works 61, 85-117 (2015)14. Securelist Spam and phishing in Q2 2019, https://securelist.com/spam-and-phishing-in-q2-2019/92379/https://securelist.com/spam-and-phishing-in-q2-2019/92379/