DeepQuarantine for Suspicious Mail
DDeepQuarantine for Suspicious Mail
Nikita Benkovich, Roman Dedenok, and Dmitry Golubev
Kaspersky, Moscow 125212, Russia { Nikita.Benkovich, Roman.Dedenok, Dmitry.S.Golubev } kaspersky.com Abstract.
In this paper, we introduce DeepQuarantine (DQ), a cloudtechnology to detect and quarantine potential spam messages. Spam at-tacks are becoming more diverse and can potentially be harmful to emailusers. Despite the high quality and performance of spam filtering sys-tems, detection of a spam campaign can take some time. Unfortunately,in this case some unwanted messages get delivered to users. To solve thisproblem, we created DQ, which detects potential spam and keeps it ina special Quarantine folder for a while. The time gained allows us todouble-check the messages to improve the reliability of the anti-spam so-lution. Due to high precision of the technology, most of the quarantinedmail is spam, which allows clients to use email without delay. Our solutionis based on applying Convolutional Neural Networks on MIME headersto extract deep features from large-scale historical data. We evaluatedthe proposed method on real-world data and showed that DQ enhancesthe quality of spam detection.
Keywords: spam filtering · spam detection · machine learning · deeplearning · cloud technology. Nowadays it is hard to imagine a life without e-mail communication, particularlyin business area. The growth of e-mail’s popularity is accounted for low cost andhigh effectiveness of exchanging messages. The same factors contribute to theincreasing amount of spam. According to a report by Kaspersky [14], the averagepercentage of spam in the global mail traffic in Q1-Q2 2019 was 57.64%, up 1.67p.p. compared to the previous reporting period. The largest share of spam wasrecorded in May (58 . a r X i v : . [ c s . CR ] J a n N. Benkovich et al. decisions must be very reliable, which obviously reduces detection rate. To solvethis problem, commercial anti-spam products delay potential spam messages torecheck them after a certain time to improve the reliability of the anti-spamsolution. The Axway Inc. in [5] described delay technique in e-mail filteringsystem, which provides a store and the transmission path of quarantined data.This mechanism was established reliable and now its different modifications areused in many companies such as Cisco, Barracuda and others.In this paper, we describe a novel approach to quarantine messages. Ourwork focuses on applying Deep Learning [13] techniques on MIME (MultipurposeInternet Mail Extensions) [12] headers to classify potential spam. Unlike mostresearch papers, our solution does not process body content of a message. Theproposed architecture has three inputs: a char sequence of Message-Id, a sequenceof headers and X-Mailer. For extracting information from sequential data, we useone-dimensional convolutional neural network (CNN). This method was appliedon characters to text classification [16]. It has been shown that this approach canbe competitive to traditional solutions for example with a simple long-short termmemory net (LSTM) [8]. Moreover, CNNs do not depend on the computationsof the previous states unlike LSTM. This fact affects model performance, whichis extremely important in real-time services.We evaluated our approach on a large-scale dataset. In the experiments, weshowed that combination of our proposed model and traditional spam filtersimproves in classification rates.
Cybercriminals continue to look for new ways to spread spam and improve pre-vious techniques. Traditional signature approaches are becoming less effectivecompared to previous years. The reasons are poor generalization ability and theneed to use human resources to find new attacks and develop signatures to blockthem.Machine learning techniques have recently become very effective to fightspam. Most research papers propose different methods to handle body contentof a message. [6] suggested a defence strategy against poisoning attacks, whenspammers enrich messages with legitimate words to defeat filters. They showedthat bagging ensembles could be very promising in this task. In [4] authors ap-plied deep learning and transfer learning techniques to detect different attackssuch as phishing, social engineering, propaganda and others. [6] demonstrated aphishing content classifier based on a recurrent neural network.There are also related works that use non-content features for spam detec-tion. [15] noted that message headers are a powerful source of features for spamfiltering. The experiments showed that using only features from headers couldachieve comparable or better performance than where using body content. In [11]and [9] authors proposed hand-crafted methods to extract features from e-mailheaders, and evaluated performance of various machine learning classifiers usinga prepared corpus. eepQuarantine for Suspicious Mail 3
Publicly available benchmark datasets on e-mail spam highlighted in [1] arenot regularly updated thus do not reflect actual threats. Publication of realemail collections is almost impossible since this data is associated with numerousconfidential and legal restrictions. Moreover, available datasets can be highlybiased because they contain conversations between a small group of users. Forexample, the popular Enron corpus [10] is deemed to be in the public domainas the result of an investigation after the company’s collapse and contains onlycommunications between Enron employees. These factors complicate research inthis area and the adaptation of the proposed methods in the real world.
In this section, we introduce the design of DQ. We describe three main parts ofthe new technology. First, we focus on backend logic, which is responsible formessage transactions and the system-customer relationship. Then we illustratepreprocessing of message headers. Finally, we show design of model for spamclassification.
According to Figure 1a messages in origin-based scheme are processed by compli-cated system of spam filters before delivered to user. Moreover, spam filters areregularly updated because statistical properties of spam campaigns change overtime. Indeed, this approach can be used and potentially provides high detectionrate. In real life, most missed spam is detected shortly after updating filters.Unfortunately, the considered scheme delivers these messages to users becausespam decisions are made once when a message is received.To solve this problem we implemented DQ, illustrated on Figure 1b. DQis a cloud technology, which provides request-response logic with an installedanti-spam service on user’s machines. The main objective of DQ is classificationof potential spam. After a messages passed through filters, the service sendsa request to DQ with message headers and waits for a response. Meanwhile,DQ handles input data and returns true if message should be delayed or false otherwise. Of course, in real life organization of this communication to processthe big data that accumulates from different user nodes is not a trivial task. Wedo not go deep into implementation details and focus on logic of the technology.As shown on Figure 1b, suspicious mail is put in the Quarantine folder for awhile, others are delivered to user. When the time is over, quarantined messagespass through filters again. It should be noted that DQ only receives requiredheaders and returns the quarantine decision, all delayed mail in Quarantinefolder is located on the user PC.The proposed scheme allows to gain the time to update filters and double-check suspicious messages to improve the reliability of the anti-spam solution.Moreover, this implementation provides a low-cost way to update the model thatis extremely important to adapt to new spam tactics or changing mail transferprotocols.
N. Benkovich et al.(a) Origin-based (b) DQ implementation
Fig. 1.
Backend of spam-filtering systems
It is well known that the feature selection plays a big role in model performance.The e-mail provides a large amount of information about a sender and contentof the message. Some of this data can be absolutely useless and add unwantednoise that can be a reason of lower model performance. Our solution is based onnon-content classification. Due to this fact, we are able to transfer data to cloudservice and collect this type of information using simpler way than in content-based case. Another important aspect is the ability to quickly extract featuresfrom message. As far as DQ is a real-time service, performance is very importantto ensure email communication without delay.At the moment, the model takes: Message-ID, a sequence of message headers(HeaderSeq) and X-mailer. To bypass protection systems and spread maliciousmail, spammers often use their own Mail User Agent (MUA). MUAs are re-sponsible for preparing email messages for transferring to a Mail Transfer Agent(MTA). One of the MUA tasks is to create and fill correct MIME headers. Someof attackers ignore it and can use random content for headers. Others try tofake headers to make them look like real ones. We focus on Message-ID andHeaderSeq for several reasons. Firstly, these features have non-trivial structure.Secondly, the form of Message-Id and the order of headers in HeaderSeq can varydepending on the type of MUA, which creates a tight connection between fea-tures. These facts make compromising more difficult, which helps the model todetect spam. We also added X-mailer to define MUA. Below we describe featuresand their representation for the classifier. eepQuarantine for Suspicious Mail 5
The Message-ID provides an identifier for messages and looks like a sequenceof US-ASCII characters between an angle bracket pair. For example: (cid:104) (cid:105)
Message-ID consists of two parts splitted by @. The left part of the Message-ID is a hash that has a specific structure for different MUAs. The right partis a domain. The Message-ID is transformed to a tensor size l , where each rowvector is a char embedding. For encoding, we build a vocabulary that maps US-ASCII chars (without special characters) to trainable embeddings. In addition,we added two symbols < EOS > for the end of a string and < U N K > forunknown characters. In case the length of Message-ID is greater than l , the first l -characters are taken. In case length of Message-ID is less than l , the sequenceis filled with < EOS > to the length l . The HeaderSeq is a sequence of MIME headers in the message. The order ofheaders can vary depending on the type of MUA. The encoding of HeaderSeqhas the same scheme as the Message-ID. The only difference is that we operatewith header names, not characters. For example: subject:from:to:date:message-id:content-type: is a possible HeaderSeq. The final representation is a tensor with fixed shapewhere each row is an encoded header. The number of rows was estimated fromstatistics as a 95-percentile of HeaderSeq length.
The X-Mailer is the name of a MUA. Before encoding, we preprocess theX-Mailer to get information only about the type of MUA. For an actual e-mailprogram, we drop information about version and release. For example:
Microsoft Windows Live Mail 14.0.8117.416 is transformed to
Microsoft . This helps significantly reduce the size of the featurespace. We also conducted experiments that used the name and version of MUA,but this did not increase performance. For an unknown e-mail program, wecreated a special category. The encoding is done by using one-hot encoding . In this section, we describe the architecture of the spam classifier. Despite the factthat DQ does not block messages, we cannot delay all of them for re-checking,because this significantly increase the delivery of an e-mail. Moreover, to ensurethat e-mail work without delays, DQ has a time limit for the response. If thetime is over, the message is delivered to the user without applying DQ. For thesereasons, we have a trade-off between model complexity and computation time.
N. Benkovich et al.
Figure 2 demonstrates the model architecture. Following [16], for Message-IDand HeaderSeq we applied a temporal CNN to extract features from sequentialdata. This kind of CNN applies convolutional filters along one dimension andcapture all units from others. Also we used the one-dimensional version of themax-pooling module applied in [3].
Fig. 2.
Architecture of spam classifier
For the Message-ID we designed a subnet with four temporal convolution lay-ers with a fixed number of filters for each of them. We applied relu as activationfunction. Initially we use a layer with biggest filter size to extract informationfrom longer subsequences. After the first and last layers, we inserted a temporalmax-pooling layer to ensure stability of training. In the HeaderSeq branch, weused two layers: a temporal convolution layer and a temporal max-pooling layer.The shallow architecture is the result of a small length of HeaderSeq. The out-puts from the convolutional nets are concatenated with the encoded X-Mailer toa one-dimensional tensor as illustrated in Figure 2. Finally, we added two fullyconnected layers and inserted a dropout [7] between them for regularization. Weused sigmoid activation to obtain the probability of spam for model’s output.
In many research papers it is stated that a CNN usually requires large-scaledatasets to work and achieve competitve performance in difference areas. Unfor- eepQuarantine for Suspicious Mail 7 tunately, public datasets for spam classification are fairly small and do not showactual threats because they are not regularly updated.In this work, we used a collection that consists of metadata from tens of mil-lions of real-time e-mail scans. We split the data into training and test datasetsby timestamp to avoid leaking information from the future into the past. Wesampled 120 million objects for training and 40 million objects for testing. Inboth datasets, the proportion of spam is about 40 percent. We optimized weightsof the model using SGD with momentum of 0 . Fig. 3.
Precision-recall curve computed on the test dataset
We show the PR-curve in Figure 3 to demonstrate the model performanceon the test data. As mentioned earlier, a classifier should have high precisionto deliver legitimate e-mail messages without delay. We defined a probabilitythreshold for which the precision is equal 0 .
998 and the recall is 0 . This article proposes a non-content-based classification approach to delay po-tential spam messages in real time. On the one hand, we demonstrated a novelfeature set and way to handle it for a spam classification task. On the otherhand, we show that this method is well-suited for enterprise solutions because ithas a simple update scheme, high performance and a low false positive rate. Fur-thermore, combining this technology with resource-intensive checks that require
N. Benkovich et al. additional time for verification/response (such as a Whois requests, in-depth con-tent verification, etc), we can get a fast and cost-effective system for detectingspam messages.
References
1. Bhowmick, A., Hazarika, S. M.: Machine Learning for E-mail Spam Filtering: Re-view Techniques and Trends. arXiv preprint arXiv:1606.01042 (2016)2. Biggio, B., Corona, I., Fumera, G., Giacinto, G., Roli, F.: Bagging Classifiers forFighting Poisoning Attacks in Adversarial Classification Tasks. In: MCS 2011,LNCS, vol. 6713, pp. 350-359, Springer,Verlag (2011)3. Boureau, Y.-L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features forrecognition. In: Computer Vision and Pattern Recognition (CVPR), IEEE Confer-ence on, pp. 2559-2566, IEEE, Finland (2010)4. Dhamani, N., Azunre, P., Gleason, J. L., Corcoran, C., Honke, G., Kramer, S.,Morgan J.: Using Deep Networks and Transfer Learning to Address Disinformation.arXiv preprint arXiv:1905.10412 (2019)5. Google Patents Delay technique in e-mail filtering system, https://patents.google.com/patent/US20090157708A1/en . Last accessed 15 Sep 20196. Halgas, L., Agrafiotis, I., Nurse, J.: Catching the Phish: Detecting Phishing Attacksusing Recurrent Neural Networks (RNNs). In: 20th World Conference on Informa-tion Security Applications, Korea (2019)7. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. R.:Improving neural networks by preventing coadaptation of feature detectors. arXivpreprint arXiv:1207.0580 (2012)8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),1735-1780 (1997)9. Hu, Y., Guo, C., Ngai, E., Liu, M., Chen, S.: A Scalable Intelligent Non-content-based Spam-filtering Framework. Expert Systems with Applications 37(12), 8557-8565 (2010)10. Klimt, B., Yang, Y.: The Enron corpus: A new dataset for email classificationresearch. In: Proceedings of the 15th European Conference on Machine Learning,pp. 217-226, Pisa, Italy. (2004)11. Qaroush, A., Khater, I. M., Washaha, M.: Identifying spam e-mail based-on sta-tistical header features and sender behavior. In: CUBE International InformationTechnology Conference, pp. 771-778, ACM, USA (2012)12. RFC 1521 Mechanisms for Specifying and Describing the Format of Internet Mes-sage Bodies, https://tools.ietf.org/html/rfc1521 . Last accessed 15 Sep 201913. Schmidhuber, J.: Deep Learning in Neural Networks: An Overview. Neural Net-works 61, 85-117 (2015)14. Securelist Spam and phishing in Q2 2019, https://securelist.com/spam-and-phishing-in-q2-2019/92379/https://securelist.com/spam-and-phishing-in-q2-2019/92379/