Improving DGA-Based Malicious Domain Classifiers for Malware Defense with Adversarial Machine Learning
IImproving DGA-Based Malicious DomainClassifiers for Malware Defense with AdversarialMachine Learning
Ibrahim Yilmaz, Ambareen Siraj, Denis Ulybyshev
Department of Computer ScienceTennessee Technological University
Cookeville, USAyilmaz42, asiraj, [email protected]
Abstract —Domain Generation Algorithms (DGAs) are usedby adversaries to establish Command and Control (C&C)server communications during cyber attacks. Blacklists ofknown/identified C&C domains are often used as one of thedefense mechanisms. However, since blacklists are static andgenerated by signature-based approaches, they can neither keepup nor detect never-seen-before malicious domain names. Due tothis shortcoming of blacklist domain checking, machine learningalgorithms have been used to address the problem to some extent.However, when training is performed with limited datasets, thealgorithms are likely to fail in detecting new DGA variants. Tomitigate this weakness, we successfully applied a DGA-basedmalicious domain classifier using the Long Short-Term Memory(LSTM) method with a novel feature engineering technique. Ourmodel’s performance shows a higher level of accuracy comparedto a previously reported model from prior research. Additionally,we propose a new method using adversarial machine learning togenerate never-before-seen malware-related domain families thatcan be used to illustrate the shortcomings of machine learningalgorithms in this regard. Next, we augment the training datasetwith new samples such that it makes training of the machinelearning models more effective in detecting never-before-seenmalicious domain name variants. Finally, to protect blacklistsof malicious domain names from disclosure and tampering, wedevise secure data containers that store blacklists and guaranteetheir protection against adversarial access and modifications.
Index Terms —Domain Generation Algorithms, AdversarialMachine Learning, Long Short-Term Memory, Data Privacy
I. I
NTRODUCTION
Security researchers have developed many different defensemechanisms in order to protect computer systems againstmalware and malicious botnet C&C communications. Black-lists of malicious websites are one of the most commonlyused defense mechanisms where lists of domain names orIP addresses that are flagged as harmful are maintained. Anymessages to/from these listed sites are feared to host potentialC&C servers and hence blocked to prevent any further com-munications. As a counterattack, attackers developed DGAsas a measure to thwart blacklist detection [1].In recent years, hacker communities have been utilizingDGAs as the primary mechanism to produce millions of ma-licious domain names automatically through pseudo-randomdomain names in a very short time period [2]. Subsets of these malicious domain names are utilized to map to the C&Cservers. These dynamically created domain names successfullyevade static blacklist-checking mechanisms. Additionally, asone domain gets recognized and blocked, the C&C server caneasily switch to another one.To overcome the limitations of static domain blacklists,machine learning (ML) techniques have been developed todetect malicious domain names and these techniques haveyielded mostly promising results [3], [4], [5]. However, theML models do not perform well with never-seen-before DGAfamilies when an unrepresentative or imbalanced trainingdataset is used. To address this problem, we propose a novelapproach to generate a rich set of training data representingmalicious domain names using a data augmentation technique.Data augmentation of an existing training dataset is oneway to make ML models learn better and, as a result, performmore robustly. Nevertheless, classic data augmentation merelycreates a restricted reasonable alternative. In our approach,as illustrated in Figure 1, an adversarial machine learningtechnique is used to generate a diverse set of augmenteddata by means of data perturbation. The generated adversarialdomain names are extremely difficult to differentiate frombenign domain names. As a result, the machine learningclassifier misclassifies malicious domains as benign ones.Afterwards, these adversarial examples are correctly re-labeledas malicious and reintroduced into the existing training dataset(see Figure 1). In this way, we augment the blacklist withdiverse data to effectively train the machine learning modelsand increase the robustness of the DGA classifiers.In addition, we devise a secure container to store and trans-fer the blacklists of malicious domain names in encrypted formas a Protected Spreadsheet Container with Data (PROSPECD),presented in [6]. PROSPECD provides confidentiality andintegrity of the blacklists so that they can be used as trainingdata to build a secure model. In addition to data integrity,PROSPECD provides origin integrity. This container protectsthe adversarial samples, used to teach the model, from un-known adversarial perturbations. The protected blacklist canbe marketed commercially [7] to internet service providers andcompanies who need to maintain their own internal blacklists. a r X i v : . [ c s . CR ] J a n ig. 1: Overview of the Proposed Methodology. A. Our Contributions
Blacklist is a security strategy that keeps network flowand computer environments secure [8]. Typical network trafficblacklists include malicious IP addresses or domain names,which are blocked from communication attempts in bothdirections. However, the coverage of blacklists is insufficientand unreliable because adversarial hacker communities cancompromise a system by generating malware domains dynami-cally using DGAs that easily bypass the static blacklist. Kuhreret al. [9] evaluated 15 public malware blacklists as well as4 blacklists, served by antivirus vendors. Their findings showthat blacklists fail to protect systems against prevalent malwaresites generated by DGAs.In order to address the shortcomings of the above problem,researchers have mostly proposed solutions based on reverseengineering techniques to identify and block bot malware[10]. However, such solutions are not always feasible due tothe obfuscation of underlying algorithms, as hackers adapttheir algorithms swiftly to exploit the vulnerabilities in thesystem. Other alternative solutions require auxiliary contextualinformation. One of these alternative solutions focuses on thenetwork traffic analysis [11], [12] or broad network packetexamination [13], [14]. However, these techniques may not beable to keep up with large-scale network traffic. Therefore,there is a need for a sophisticated network traffic analysis toolfor effective blacklisting.In response to the above issues, detection of maliciousdomains has increasingly evolved towards usage of machinelearning techniques. Performance of the solutions proposed toautomatically detect malicious domains mostly suffers fromnever-seen-before malicious domains. This is due to the lackof generalization when a model is not trained effectively withrepresentative or balanced training dataset. For this reason,blacklists must be constantly updated in order to identify andprevent DGA generated malicious domain connections.Data augmentation is an approach where more data iscreated from existing data in a way that can enhance theusefulness of the application. Anderson et al. [15] demon-strated how data can be augmented more effectively by usingan adversarial machine learning technique. The researchersenhanced adversarial domain names using Generative Adver-sarial Network (GAN) methodology. In their approach, twoneural networks are trained simultaneously, and later trainedwith a dataset which includes adversarial samples to harden theDGA classifier. However, the main drawback of this approach is unpredictability of desirable results because of difficultiesof controlling both classifiers at the same time, even thougha good optimization algorithm is used. As a result, it failsto always converge to a point of equilibrium to generatenew domain names. Additionally, controlling the diversityof produced samples is challenging with GAN models [16],[17]. In such cases, the newly generated data do not addto the diversity of the current data. Hence, this solutionalone cannot increase the malicious detection capabilities ofblacklists against never-before-seen DGA families.To improve the accuracy of such detection mechanisms, wepropose a new technique based on data perturbation withoutrelying on a fresh public blacklist or external reputationdatabase. In our approach, we observe how the model worksand use the knowledge to mislead the DGA classifier. To dothis, a noise, carefully calculated from the observation is addedto the DGA based malicious domains to appear non-malicious.These adversarial samples are then predicted as benign by themachine learning (ML) model. Such adversarial attack can beaddressed with an adversarial training [18]. Therefore, aftercorrectly labeling these seemingly benign adversarial samples,the ML model is trained with the augmented dataset. Theexperimental results demonstrate that the retrained ML modelis able to detect never-before-seen DGA malwares better thanany other similar approaches.Our work has the following contributions: • Using a machine learning technique based on the LongShort-Term Memory (LSTM) model for automatic detec-tion of malicious domains using a DGA classifier thatanalyses a massive labeled DGA dataset, character bycharacter over. • To the best of our knowledge it is the first study to proposethe generation of malicious domain names using a dataperturbation approach in order to expand the trainingdataset with adversarial samples. • Demonstrating that, as expected, the LSTM model failsto recognize newly introduced adversarial samples inaugmented training dataset. • Applying an adversarial training to train the model withcorrect labelling of the adversarial samples in the trainingdataset to increase the model’s generalization ability. • Demonstrating that the augmented training dataset canhelp the LSTM model to detect not only never-seen-before DGAs, but also novel DGA families.The rest of this paper is organized as follows: the literatureeview in the context of our work is discussed in section II.The necessary background for DGA-based malicious domainmodels is reviewed in Section III. The core design of oursystem, including the adversarial machine learning modelsto generate malicious domain names and data containers tostore them, are presented in Section IV. We discuss theimplementation of the adversarial machine learning models inSection V. The evaluation results of our study are presentedin Section VI. Section VII concludes the paper.II. R
ELATED W ORK
Domain Generation Algorithms and detection of maliciousdomain names have been previously analyzed by differentresearchers for a number of years. Daniel et al. [19] presenteda taxonomy of DGA types by analyzing characteristics of43 different DGA-based malware families and compared theproperties of these families. They also implemented previousstudies with 18 million DGA domain data that was created toidentify malicious domains. They reported further progress inDGA detection.Detection of DGA botnets became feasible with the im-plementation of powerful machine learning models. Lisonet al. [20] implemented a recurrent neural network modelfor the detection of DGAs. Their empirical study detectedmalicious domain names with high accuracy. Justin et al. [21]defined several models to detect malicious web sites, includinga logistic regression, support vector machine, and bayesianmodel. They used the DMOZ dataset for benign websiteswhile PhishTank and Spamscatter were used for maliciouswebsites. Bin et al. [22] addressed the same issue using dif-ferent machine learning classifiers for the detection of DGAs.They created a convolutional neural network (CNN) and arecurrent neural network (RNN) machine learning model forthe classification of malicious and benign domain names. Theycompared the results of both models in terms of their perfor-mance and reported that both models performed comparably.Duc et al. [23] dealt with the multiclass imbalance problemof LSTM algorithms for the detection of malicious domainnames by generating DGAs. The authors claimed that theLSTM algorithms performed poorly with imbalanced datasets.To tackle this imbalanced dataset problem, they proposed aLong Short-Term Memory Multiclass Imbalance (LSTM.MI)algorithm and showed that their proposed algorithm providedmore accurate results by implementing different case studies.In addition, Mayana et al. [24] introduced a WordGraphmethod to recognize dictionary based malicious domains. Theauthors asserted that more sophisticated DGAs are able toavoid detection of conventional machine learning classifiers.They carried out their experiments by extracting dictionaryinformation without using reverse engineering. Bin et al. [25]defined a deep neural network model as an inline DGAdetector. They caution about most of the available datasets notbeing good representations for malicious domains or outdated.Hence, machine learning models’ perform poorly when trainedusing such datasets. Furthermore, they explained that reverseengineering was a difficult method for training models. To tackle these problems, the researchers offered a novel detectorfor malicious domains without the need for reverse engineer-ing. Their proposed technique was based on real traffic andreported to detect malicious domains in real time. Woodbridgeet al. [3] created a machine learning classifier based on theLSTM network to detect malicious domain names in real time.Their classifier detected multiclass domains by categorizingthem into particular malware families. The model predicted adomain name as malicious or benign according to this domainname without any additional information.However, although these studies achieved high detectionrates for particular DGA families, the performance of machinelearning based detection systems are poor with new DGAvariants when the models are trained with unrepresentativeor imbalanced training datasets. To handle this issue, Ander-son et al. [15] offered a GAN algorithm to generate newdomain names. In their GAN approach, they implementedtwo different deep neural network classifiers named discrim-inator and generator. According to this GAN methodology,new malicious domain names are generated by the generator,which evades the discriminator’s detection. Their case studiesdemonstrated that new malicious domain names also bypassa random forest classifier. Once the model was trained withadversarial samples, it hardened the model against new DGAfamilies. However, the authors did not test it on DGA familiescreated using a dictionary. Additionally, implementation ofthis approach is challenging due to the need of controllingtwo machine learning models, which might be unsuitablefor detecting malware-related domain names. Unlike this ap-proach, we propose to augment data by using an efficientdata perturbation technique that generates hard-to-detect DGAfamilies and identifies DGA types that are created eitherrandomly or using a dictionary.To overcome limitations of machine learning models inaforementioned circumstances, Curtin et al. [26] proposedto combine a neural network model with domain registra-tion supplementary information. This additional information,which is known as WHOIS data, helped the neural networkmodel to efficiently identify the most difficult samples thatwere generated using English words. However, cybercriminalstake advantage of bulk registration services by registeringthousands of domain names in a short time, several monthsbefore the start of nefarious activities [27]. In addition, unau-thorized people can access this information and falsify it byimpersonating as legitimate users. This makes the informationquestionable. Compared to this, our approach efficiently de-tects DGA families solely based on the domain names, withoutrelying on any supplementary information.PROSPECD container used to store and transfer blacklistedmalicious domain names, is presented in [6]. Compared to theprivacy-preserving data dissemination concept as proposed byLilien and Bhargava [28], it has the following features: • Detection of several types of data leakages that can bemade by authorized entities to unauthorized ones; • Enforcement of access control policies either on a centralerver or locally on a client’s side in a Microsoft Excel ®1 Add-in or in a cross-platform application [6]. • Container implementation as a digitally signed water-marked Microsoft Excel ® -compatible spreadsheet filewith hidden and encrypted data and access control poli-cies worksheets. • On-the-fly key derivation mechanism for data worksheets.The primary difference between PROSPECD and an ActiveBundle [30], [31], [32] is that PROSPECD does not store anembedded policy enforcement engine (Virtual Machine).In contrast with a solution to encrypt the desired cells ina spreadsheet file, proposed by Tun and Mya in [33], inPROSPECD all the data worksheets are encrypted with theseparate keys, generated on-the-fly. PROSPECD supports role-based and attribute-based access control. Furthermore, digitaland visual watermarks are embedded in PROSPECD to enabledetection of data leakages. A Secure Data Container, pro-posed in [34] to store device state information, only supportscentralized policy enforcement. PROSPECD supports bothcentralized and local policy enforcement mechanisms [6].III. B
ACKGROUND
In this section, we review background information relatedto our research. Domain Generation Algorithm (DGA) : Domain gen-eration algorithms are primary means to connect variousfamilies of malware with new or never-before-seen domainsto avoid detection. There are many such DGA-based malwarefamilies (malware that connect to DGA generated domainnames). According to a study, the five most known familiesare
Conficker , Murofet , BankPatch , Bonnana , and
Bobax [35].Although many DGA-based domain names are produced ran-domly, some are generated using a dictionary. The detectionof these types of domain names is more difficult because oftheir similarity to legitimate domains. Gradient Descent Algorithm : The gradient descent al-gorithm is the most popular optimization method for usingmachine learning classifiers to minimize errors. It takes intoaccount the first derivative when modifying all parametersunder considerations [36]. Gradient descent always strives tofind the most appropriate way to minimize errors. The learningprocess starts with randomly producing weight values. Most ofthe time, these values are set to an initial value of zero and areused to calculate the lost function value. It then uses the gradi-ent descent algorithm to find a way to reduce the lost function.All weights are updated through the backpropagation processbased on the gradient descent algorithm. We generate newadversarial domain names in our data augmentation methodby utilizing and modifying the gradient descent algorithm’sbehavior. In Section V, explanation of how such adversarialsamples are created is discussed in detail. This paper is an independent publication and is neither affiliated with, norauthorized, sponsored, or approved by, Microsoft Corporation [29] Long Short-Term Memory (LSTM) Model : The LSTMmodel, a specialized Recurrent Neural Network (RNN), isused in our approach for automatic detection of maliciousdomains. RNNs are known as supervised machine learningmodels that are commonly used to handle the processing ofsequential data [37]. An RNN takes the previous and currentinputs into account while the traditional networks consider allinputs independently. In our study, we implement the modelon a character by character basis, so that it captures theimportance of the order of the characters’ occurrence in theword. Essentially, the model learns the occurrences of thecharacters in a sequential way. For example, for the domainname google , without the top-level domain, the model firstlearns the character g , then o , predicting that it succeeds g .However, traditional neural networks do not take the positionof the characters into account. In addition, the traditionalneural networks process fixed-sized input. Similarly, they cancreate fixed-size output, whereas RNN does not have suchlimitations.When input data like domain names have long term se-quences, traditional RNN struggles to learn such long datadependencies, which is known as the vanishing gradient prob-lem [38]. In order to avoid this problem, LSTM, which is aspecial kind of RNN, was introduced in [37].LSTM relies on the gating mechanisms, where the in-formation can be read, written or erased via a set ofprogrammable gates. This allows recurrent nets to keeptrack of information over many time-steps and gain theability to preserve long term dependencies. For example, is a malicious domainname, which has a 32 character length. LSTM keeps track ofthe relevant information about these characters throughout theprocess. Furthermore, a study has shown LSTMs outperformprevious RNNs for the solution of both context-free language(CFL) and regular language problems [39]. Researchers havereported that LSTMs generalize better and faster, leading tothe creation of more effective models. In our study, the LSTMmodel learns how to detect DGA-based malicious domains ina similar way to what is mentioned above.IV. C ORE D ESIGN
A. Generating New Malicious Domain Names
Machine learning models’ performances substantially relyon the training dataset crucial for building effective classi-fiers. However, one of the biggest challenges with any MLmodel is accumulating a training dataset that is representativeand balanced enough to enable the creation of an effectivemachine learning model. This process might be costly ortime-consuming or both. A restrictive training dataset canlead to poor performance of the ML model and that isthe primary reason DGA classifiers do not work well withautomated malicious domain name detection. With traditionalblacklists used in training, ML classifiers cannot detect never-before-seen DGA families. The model needs to be readjustedconstantly with new variations of training data for effectivethreat detection. To address this issue, we propose to create alacklist with domain names using a novel adversarial machinelearning technique.Our adversarial approach is based on data perturbation tech-niques inspired by [18], where domain names are influencedbased on the gradient descent algorithm of a targeted DGAclassifier with regards to the classifier loss. Changing thegradient of the classifier maximizes loss function instead ofminimizing it can mislead the malware detection classifiers.Even though these domains are malicious, the DGA classifier,based on the LSTM model, predicts them as benign. As aresult, new adversarial domain are generated that can seembenign by not matching the blacklist data. Our method isclarified mathematically below [18].Let x be a given malicious domain name, y be labeled asmalicious, M be a DGA classifier that M(x): x → y, ˆ x iscrafted domain name using our adversarial attack and ˆ y is theclass label that M(x): ˆ x → ˆ yobjective max l ( M, ˆ x, y ) (1) y (cid:54) = ˆ y (2) subject to ˆ x = x + δx (3)Here l(M.x,y) is the loss function of the DGA classifier in(1). A newly created adversarial domain name is predictedas benign by the DGA classifier in (2). δ x in (3) representsperturbation added to a vector form of the given domain name. δ x is calculated below [18]: δx = (cid:15) sign ( ∇ x l ( M, x, y )) (4)Here sign ( ∇ x l(M,x,y) ) represents the direction of theloss function, which minimizes the loss function of the DGAclassifier, and (cid:15) controls the expansion of the noise in (4).The smaller epsilon value perturbs the original feature vectorslightly, while the larger one perturbs the original featurevector significantly. This misleads the DGA classifier to a greatextent. On the other hand, a larger perturbation can be moreeasily detected than a smaller one by human eyes.The calculated noise is included in each character embed-ding of input data. The resulting embedding characters withnoisy calculations are compared to each character’s embeddingusing cosine similarity [40] to measure each distance. The finalcharacter is chosen based on this operation. The design of themodel is demonstrated in Figure 2. According to the examplein the figure, character g turns to c with our use of adversariallearning. Our adversarial domain name generation algorithmis summarized in Algorithm 1.The DGA detectors can be seen as black-box devices in real-world settings. Since, in the black-box scenario, an adversarydoes not have any knowledge about the inner workings ofthe target model. Nevertheless, for the sake of simplicity,we implement our proposed technique under the white-boxassumption, where we obtain optimum perturbation by ac-cessing the target model so that we can compute gradients.Although the black-box assumptions can be perceived as morerealistic for this work, it is important to keep in mind that Transpose of thenew embedding ⋅ ( ( , , ))∇ . ...... ⋮ Embedding of character CalculatedNoise New Embedding ⋮ ⋮ Maximum element of the arraywhich corresponds to the mostsimilar character to is selected + ⋅ ( ( , , )) = ∇ EmbeddingMatrix = =+
Fig. 2: Generation of Adversarial Samples Using Character-Level Transformations.
ALGORITHM 1:
Pseudocode of our proposed ad-versarial domain name generation approach Input: { X i , Y i } where X i = Eachdomain name and Y i = Correspondingground-truth label3 - X =( x || x ||...|| x n ) where x i = Eachcharacter of a given input domain name4 - Training iteration number N itr , Numberof adversarial examples N adv , Number oftraining samples N train , Number ofcharacter of a given input domain name N total { X j , Y j } Function generate domain names7 for iteration = 0, ..., N itr do end for iteration = 0, ..., N adv do for iteration = 0, ..., N total do δ x i = (cid:15) × sign ( ∇ x i l ( M, x i , y i )) z i = x i + δ x i ˆ x max = z i • x √ z i × x i end Output: ˆ X =( ˆ x || ˆ x ||...|| ˆ x n ) end end previous studies showed that adversarial samples have the transferarability property [41]. This means that an adversarialexample generated for one DGA model is more likely to bemisclassified by another DGA detector as well, since whendifferent ML models are trained with the similar dataset fromthe same source, they learn similar decision boundaries. Weleave testing adversarial examples the under black-box settingsor future work. B. Protected Spreadsheet Container with Data (PROSPECD)for Domain Names Blacklists
We propose to use a PROSPECD data container, presentedin [6], to securely store and transfer blacklisted maliciousdomain names. In our use case, PROSPECD, implementedas an encrypted and digitally signed spreadsheet file, containsthe following watermarked data worksheets: • ”Domain Blacklist” to store encrypted malicious domainnames, detected by our classifier; • ”Metadata” to store encrypted metadata, which includeaccess control policies; • ”General Info” to store encrypted information about theclassifier used to detect the malicious domain names andits execution details.PROSPECD provides data confidentiality and integrity,origin integrity, role-based and attribute-based access controland centralized and decentralized enforcement of accesscontrol policies. Digital and visual watermarks, embeddedinto a PROSPECD spreadsheet file, enable detection of severaltypes of data leakages that can be made behind-the-scenes byauthorized parties to unauthorized ones [6]. PROSPECD Generator.
The malicious domain names clas-sifier runs on a trusted server. Once the blacklist of domainnames is generated, the dedicated process writes it, as wellas the relevant information, in a spreadsheet file. Then thePROSPECD generator, currently implemented as a commandline utility, is called. It takes a spreadsheet file with the ”Do-main Blacklist” worksheet and two other worksheets (”GeneralInfo” and ”Metadata”) in a plaintext form, as an input, andgenerates a separate spreadsheet file with encrypted work-sheets. Each separate worksheet is encrypted with a separatesymmetric 256-bit AES key, generated on-the-fly [6].
PROSPECD Data Access on a Trusted Server.
ThePROSPECD container, stored on a trusted server, can beaccessed remotely from a web viewer. The client opens theAuthentication Server (AS)’s URL in a web browser, selectsthe data subset to retrieve (”Domain Blacklist”, ”General Info”or ”All”) and enters their credentials: username (role) andpassword. The accessible worksheets from PROSPECD aredecrypted, using the on-the-fly AES key derivation scheme,based on the client’s role and attributes. These attributesinclude versions of a web browser and an operating system,as well as the type of device the client uses. Decryptedworksheets are sent to the client as a JSON object via httpscommunication channel [6]. PROSPECD supports RESTfulAPI. Table I shows the access control policies. The role”User” can only access blacklisted domain names from the”Domain Blacklist” worksheet. The role ”Administrator” isallowed to access all the worksheets and also to download thePROSPECD file from the server to their local device, to accessthe data locally or transfer it to other parties.
Local PROSPECD Data Access.
Authorized parties canaccess PROSPECD data locally, either from the Microsoft TABLE I: PROSPECD Access Control Policies
Domain Blacklist General Info MetadataAdministrator YES YES YESUser YES NO NO
Excel ® Add-in or from the standalone cross-platform appli-cation. For the first option, a user needs to ”download andinstall the Microsoft ® Excel Add-in” [6], written in C ® , opena PROSPECD container and enter valid credentials. Then thePROSPECD’s digital signature is verified. If it is valid, the”Metadata” worksheet is decrypted and access control policies,stored in this worksheet and shown in Table I, are evaluated.Then the decryption keys for accessible data worksheets arederived [6]. To prevent unauthorized data disclosures, theauthenticated user is not allowed to print or save the openedspreadsheet file once the Add-in has been launched. When auser closes an application, all the data are encrypted back totheir original values and the visibility of all the worksheets isreset back to VeryHidden ® , after which the application closes.In addition to the Microsoft ® Excel Add-in, a cross-platformapplication was developed to view PROSPECD data [6]. Thisapplication provides a graphical user interface and does notallow the user to store the decrypted PROSPECD files locally,to prevent possible data leakages.V. E
XPERIMENTAL M ETHODOLOGY
This section describes the dataset used to build a DGAclassifier based on an LSTM model, along with the explanationof the model implementation.
A. Dataset
The experimental dataset includes one non-DGA (benigndomain names) and 68 DGA families (malicious domainnames). Data is collected from two different sources, whichare publicly available [42], [19].For benign domains, we use the Majestic top 1 milliondataset [42]. This dataset includes the top one million web-site domain names all over the world and the dataset isupdated daily. For malicious domains, we obtain data fromthe DGArchive, which is a web repository of DGA-basedmalware families [19]. This repository has over 18 millionDGA domains. We have worked on 68 DGA malware fam-ilies with some being generated by traditional DGAs. Theremaining families were produced by dictionary DGAs. Weused both traditional-based DGAs and dictionary-based DGAswith over a hundred thousand malicious domains. To ensure afair comparison, we used a subset of 135K samples from theMajestic top 1 million dataset so that the classifier does notbias towards the majority class and thus, prevent occurrenceof overfitting.
B. ML Model Implementation
We implement our LSTM model in Python using Pytorch[43]. We use the LSTM at character level with application ofcharacter embeddings (vector forms of the characters). Thiseans that we every character is encoded to a representativevector. We convert from the word spaces to vector spaces toextract better features for the machine learning classifier. EachASCII character (total number is 256) represents a vectorwhose size is set to 256. In this way, we create a 256 by256 embedding matrix where each row represents a characterand character embeddings are represented by column vectorsin the embedding matrix. Once the perturbation technique isemployed, embedded vectors of the character of maliciousdomains are transformed into word space again.We divided the dataset into training and test data. We use90% of the dataset for training and remaining is reserved fortesting. The training is used by the model to learn detectionof DGA-based malicious domains. In our implementation,only the domain names are considered by the model, andthe characters are pulled from the domain names character bycharacter. At each time interval, one character’s correspondingvector is fed into the LSTM model. The character embeddingsare randomly initialized at the beginning. The model is able tolearn through the dependencies that is has with each other andthe conditional probabilities of the aforementioned characters.Thus, each character’s embedding is learned by the LSTMmodel itself and the matrix is filled with these embeddings. Inthe test phase, the unseen data is predicted by the model asmalicious or benign. The model’s performance is analyzed insection VI in detail.In this work, our main goal is to augment the training datasetto increase the model’s resiliency and improve performancefor detection of never-before-seen or yet-to-be-observed DGAfamilies. To do this, an optimally calculated noise is addedto each character embedding of the input data by the dataperturbation technique. The newly created embedding with theaddition of noise may not be assigned to any character. There-fore, the model looks for the closest embedding character andassigns it as the character to that corresponding embedding.Here we use approximate similarity search [44] by applyingc osine similarity [40] that takes dot products between newlycreated embedding and each row of the embedding matrixto calculate the similarity. The new character is assigned tothe row matrix’s corresponding character, yielding maximumsimilarity value.Technically speaking, the LSTM model consists of twohidden layers along with the input and output layers. The dropout , a known regularization technique, is used with a rate of0.5 in order to avoid overfitting. The fine-tuned parameters arefound by using a batch size of 128 and a learning rate of 0.001along with an epoch number of 6. To achieve learning rateslower than this, more iterations may be needed.In addition, we use the adam optimization algorithm anextension to stochastic gradient descent, to minimize theerror by adjusting network hyper parameters in an iterativeway. Furthermore, binary cross entropy utilized for binaryclassification, is used as the loss function in order to measurethe cost over probability distribution of malicious and benigndomain names. VI. E
VALUATION
The results of our experiments are divided into two sections.Firstly, we evaluate the performance of the newly proposedLSTM model and compare it with a previous work known asDeepDGA in terms of model accuracy [15]. In addition, wereport on how the model accuracy changes against temperingof input samples in order to generate adversarial instances.Finally, we analyze the DGA classifier before and after adver-sarial augmentation of training data.At first, a binary classification, which simply predicts be-tween DGA-based malicious or benign (Alexa top 135K)samples are applied. Table II demonstrates a comparisonbetween the performance of the Deep-DGA model and ourLSTM-based model. The detection rates of Cryptolocker andDicrypt is higher with DeepDGA than our DGA classifier withthe available samples. On the other hand, Locky V2, Pykspa,Ramdo and Simda are detected with better accuracy by ourclassifier, and the rest of the cases show the same detectionrate for both. Even though the results demonstrated that theimprovement was not substantial, the results could have turnedout to be different, since we used a different dataset thanDeepDGA.TABLE II: Deep-DGA and the Proposed Model Comparison
Deep-DGA The Proposed DGA DetectorCorebot 1.0 1.0Cryptolocker
Pykspa 0.85
Qakbot 0.99 0.99Ramdo 0.99
Ramnit 0.98 0.98Simda 0.96
Average 0.97
A. LSTM model results
Table III shows the resulting detection rates for our modelfor 68 DGA families. Our findings show that our methodperformed with high accuracy (usually above the 0.97 accuracymargin) for most of the DGA-based malware families.We also evaluate the DGA classifier performance consid-ering standard metrics such as precision, recall, F1-score,false positive rate (FPR), false negative rate (FNR) and areaunder the receiver operating characteristic curve (AUC). Theseevaluation metrics are widely used to measure the qualityof the ML model. Using the proposed algorithm, we craftdomain names from both benign (Alexa 135K) and DGA-based malicious samples. Based on different epsilon values,the changes in these evaluation metrics of the model can beviewed in Table VI. Initially, we set the epsilon value tozero to observe the actual performance of the model. Ourfindings show that the model performs well in terms of theaforementioned metrics. In case of the adversarial samples thatwere generated from malicious domain names, the accuracyrate of the LSTM model degrades with increase in epsilonvalue until it stabilizes at an equilibrium, because, at that
ABLE I: Detection rate of each DGA malware family usingthe LSTM model
Detection Rate Number ofSamplesBamital . Banjori . beebone . Blackhole . Bobax . Chir . Corebot . Darkshell . Dyre . Ebury . Emotet . Feodo . Gameover . Gameover P2P . Gspy . Infy . Modpack . Murofetweekly . Murofet . Pandabanker . Ramdo . Ranbyus . Redyms . Rovnix . Sisron . Sutra . Tsifiri . Ud2 . Ud3 . Vidrotid . Wd . Xshellghost . Xxhex . Chinad .
99 1129
Diamondfox .
99 829
Locky V2 .
99 3549
Oderoor .
99 1200
Padcry .
99 1220
Qadars .
99 2000
Qakbot .
99 4000
Sphinx .
99 2000
Tinba .
99 1998
Cryolocker .
98 4129
Dircrypt .
98 500
Ramnit .
98 1200
Simda .
98 2000
Szribi .
98 1200
Volatilecedar .
98 498
Bedep .
97 1028
Ekforward .
97 578
Fobber .
97 725
Pushdotid .
97 900
Pykspa 2 .
97 1200
Urlzone .
97 2000
Necurs .
96 4201
Dnschanger .
94 1228
Suppobox .
94 12 000
Tempedrevetdd .
94 1000
Proslikefan .
93 1098
Torpig .
93 1200
Vidro .
93 1276
Mirai .
92 500
Pykspa 2S .
92 1200
Pykspa .
90 1201
Ud4 .
90 100
Hesperbot .
87 150
Shifu .
87 2000
Nymaim .
85 1200
TABLE II: Detection rate of each DGA malware family beforeand after training data augmentation
Detection Ratewith InjectedAttack Samples(Epsilon=11) Detection Ratewith Re-labelledAttack Samples ImprovementRate %Beebone . . Tsifiri . . Ramdo .
04 1 . Sisron .
04 1 . Redyms .
07 1 . Ebury .
05 0 .
97 92
Szribi .
08 1 . Simda .
08 0 .
99 91
Ud4 . .
90 90
Ranbyus .
14 1 . Vidrotid .
14 1 . Pykspa 2 .
09 0 .
93 84
Ud3 .
17 1 . Cryptolocker .
17 0 .
98 81
Fobber .
19 1 . Pykspa 2S .
17 0 .
96 79
Darkshell .
04 0 .
80 76
Suppobox .
22 0 .
98 76
Urlzone .
30 1 . Corebot .
32 1 . Locky V2 .
28 0 .
96 68
Necurs .
31 0 .
98 67
Volatilecedar .
03 0 .
70 67
Vidro .
31 0 .
95 64
Hesperbot .
17 0 .
80 63
Padcrypt .
35 0 .
98 63
Bedep .
37 0 .
99 62
Emotet .
38 1 . Pykspa .
29 0 .
91 62
Tempedrevetdd .
35 0 .
96 61
Dnschanger .
38 0 .
98 60
Oderoor .
37 0 .
97 60
Proslikefan .
35 0 .
94 59
Pushdotid .
30 0 .
89 59
Qadars .
42 1 . Sphinx .
41 0 .
99 58
Bobax .
43 1 . Diamondfox .
31 0 .
84 53
Dircrypt .
44 0 .
96 52
Feodo .
48 1 . Sutra .
49 1 . Ramnit .
48 0 .
98 50
Qakbot .
52 1 . Mirai .
49 0 .
96 47
Modpack .
53 1 . Shifu .
43 0 .
89 46
Nymaim .
42 0 .
87 45
Xshellghost .
31 0 .
75 44
Torpig .
39 0 .
79 40
Rovnix .
64 1 . Blackhole .
66 1 . Gameover .
68 1 . Murofet .
69 1 . Tinba .
75 1 . Chinad .
76 1 . Banjori .
77 1 . Pandabanker .
78 1 . Dyre .
81 1 . Ekforward .
82 1 . Gameover P2P .
83 1 . Infy .
84 1 . Xxhex .
85 1 . Chir .
98 1 . Gspy .
98 1 . Bamital .
99 1 . Murofetweekly .
99 1 . Ud2 .
99 1 . Wd .
99 1 . TABLE III: Detection Rate of DGA Malware Families Usingthe LSTM Model. TABLE IV: Detection Rate of DGA Malware Families Beforeand After Training Data Augmentation.
ABLE V: Transformation of Alexa Domain Name Samples.
Malicious Benign
Epsilon Accuracy Precision Recall F1 FPR FNR AUC Epsilon Accuracy Precision Recall F1 FPR FNR AUC0 0.98 0.98 0.99 0.98 0.03 0.01 0.99 0 0.98 0.98 0.97 0.97 0.01 0.02 0.981 0.95 0.95 0.95 0.95 0.13 0.09 0.94 0.7 0.70 0.69 0.65 0.65 0.40 0.15 0.70
TABLE VI: Performance of the DGA Classifier vs. Penetration Coefficient for Both Benign and Malicious Domains.point, the model has been trained well enough to recognizemalicious domains. In addition, the dissimilarities between thebenign class and the malicious class drastically increase. Thisindicates the limit of misclassification, even with increasingepsilon values.We also consider benign instances as an input to corruptthe benign samples for creating the adversarial domain names.Subtle perturbations do not decrease an accuracy much, sincethe injected epsilon values do not manipulate the original datasufficiently to cause misclassification. When we continue to in-crease the penetration coefficient that causes slight differencesto the original benign data, the model fails to recognize thesechanges. Therefore, the model performance is dramaticallyimpaired. As we further scale up the noise, the model startsto predict these drastic modifications more accurately, due tosevere degradation of the input. Table V shows how the domainname samples are transformed by the epsilon values.It is noteworthy to observe the decreasing accuracy of ourDGA classifier as we add perturbations because adversarialexamples mislead the model into making incorrect decisionsthat increase the number of false positives and false negatives.The various adversarial samples that are created by injectingnoise has the potential to deceive the LSTM model evenmore than the adversarial samples generated by GAN. Inthe study of Anderson et al. [15], it was found that thedetection rate of the model is 48.0%, which means that theyachieved an attack success rate of about 53%. Table VI showsinstances of LSTM model’s accuracy to be around 45% withdifferent perturbation coefficients. We achieved the highestattack success rate of 56%, which is higher than the GANapproach by 3%, indicating that our model generated DGAfamilies are able to deceive the ML model more effectively.
B. Improving the LSTM Model with Augmented Training Data
As discussed above, we are able to successfully produceadversarial domain names that can bypass detection by theLSTM model. We show that successful augmentation oftraining data samples can be done with our proposed method. Changes in penetration coefficients can impact the DGAclassifier to different extents in terms of the model accuracy.We later modify the dataset by injecting correctly labelledadversarial domains. We replace every malicious trainingsamples with its adversarial counterpart including the topAlexa 135K in the training set, and re-train the model. TableIV illustrates the differences before and after training withadversarial samples when the epsilon value is 11. Our reasonfor selecting the value 11 for the epsilon is to illustrate themaximum damage to the well-trained LSTM model and howtraining the model with augmented data performs much better.When the model is trained with adversarial samples, themodel is able to detect unseen malicious samples in thetraining set to a larger extent. The hardened classifier increasesthe model’s detection ability for each DGA family, as can beseen from Table IV. As noted for some family groups, suchas Bamital, Gspy, and Ud2, the adversarial manipulation didnot have any significant impact on model accuracy (within1%). However, for most others, the training with augmenteddata boosted accuracy immensely, on some occasions reachingup to 100%. As a result, the model trained with adversarialsamples has shown to perform much more accurately, close tothe performance of the model before adversarial manipulation.VII. C
ONCLUSIONS
In this paper, we presented a novel detection system basedon an LSTM model for the detection of both traditional anddictionary-based DGA generated domains using the character-by-character approach. Our experimental findings show thatthe LSTM-based model can detect malicious domains with atrivial margin of error.However, machine learning models are unable to learn thecharacteristic behaviors of DGA-based malicious domains ifthere are new or never-seen-before data in the testing dataset.In this study, we highlight this issue with an adversarial ma-nipulation using different data perturbation cases. Accordingto our findings, newly generated domains using the proposedperturbation approach could not be detected by the DGA clas-sifier. After we trained the model with the augmented trainingataset, including adversarial samples, the experimental resultsshow that the LSTM model was able to detect previouslyunobserved DGA families.We store malicious domain names, detected by our model,in a Protected Spreadsheet Container with Data (PROSPECD).It provides data confidentiality and integrity, as well as ori-gin integrity, role-based and attribute-based access control.PROSPECD protects the domain names in transit and at restagainst adversarial access and modifications.R
EFERENCES[1] (2020, September) Domain generation algorithm. [Online]. Available:https://en.wikipedia.org/wiki/Domain generation algorithm[2] S. Yadav, A. K. K. Reddy, A. Reddy, and S. Ranjan, “Detectingalgorithmically generated malicious domain names,” in
Proc. of the 10thACM SIGCOMM Conf. on Internet measurement . ACM, 2010, pp. 48–61.[3] J. Woodbridge, H. S. Anderson, A. Ahuja, and D. Grant, “Predictingdomain generation algorithms with long short-term memory networks,” arXiv preprint arXiv:1611.00791 , 2016.[4] I. Yilmaz, R. Masum, and A. Siraj, “Addressing imbalanced dataproblem with generative adversarial network for intrusion detection,”in . IEEE Computer Society, 2020, pp. 25–30.[5] I. Yilmaz, “Practical fast gradient sign attack against mammographicimage classifier,” arXiv preprint arXiv:2001.09610 , 2020.[6] D. Ulybyshev, C. Bare, K. Bellisario, V. Kholodilo, B. Northern,A. Solanki, and T. O’Donnell, “Protecting electronic health records intransit and at rest,” in
Intl. Workshop on Recent Advancesin Intrusion Detection . Springer, 2014, pp. 1–21.[10] M. Ligh, S. Adair, B. Hartstein, and M. Richard,
Malware analyst’scookbook and DVD: tools and techniques for fighting malicious code .Wiley Publishing, 2010.[11] J. Zhang, R. Perdisci, W. Lee, U. Sarfraz, and X. Luo, “Detect-ing stealthy p2p botnets using statistical traffic fingerprints,” in .IEEE, 2011, pp. 121–132.[12] T.-F. Yen and M. K. Reiter, “Are your hosts trading or plotting? tellingp2p file-sharing and bots apart,” in . IEEE, 2010, pp. 241–252.[13] J. Manni, A. Aziz, F. Gong, U. Loganathan, and M. Amin, “Network-based binary file extraction and analysis for malware detection,” Jan. 132015, uS Patent 8,935,779.[14] A. Aziz, H. Uyeno, J. Manni, A. Sukhera, and S. Staniford, “Electronicmessage analysis for malware detection,” Jul. 17 2018, uS Patent10,027,690.[15] H. S. Anderson, J. Woodbridge, and B. Filar, “Deepdga: Adversarially-tuned domain generation and detection,” in
Proc. of the 2016 ACMWorkshop on Artificial Intelligence and Security . ACM, 2016, pp.13–21.[16] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilib-rium generative adversarial networks,” arXiv preprint arXiv:1703.10717 ,2017.[17] I. Yilmaz and R. Masum, “Expansion of cyber attack data fromunbalanced datasets using generative techniques,” arXiv preprintarXiv:1912.04549 , 2019.[18] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[19] D. Plohmann, K. Yakdan, M. Klatt, J. Bader, and E. Gerhards-Padilla,“A comprehensive measurement study of domain generating malware,”in { USENIX } Security Symp. ( { USENIX } Security 16) , 2016, pp.263–278. [20] P. Lison and V. Mavroeidis, “Automatic detection of malware-generated domains with recurrent neural models,” arXiv preprintarXiv:1709.07102 , 2017.[21] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists:learning to detect malicious web sites from suspicious urls,” in
Proc. ofthe 15th ACM SIGKDD Intl. Conf. on Knowledge discovery and datamining . ACM, 2009, pp. 1245–1254.[22] B. Yu, J. Pan, J. Hu, A. Nascimento, and M. De Cock, “Character levelbased detection of dga domain names,” in . IEEE, 2018, pp. 1–8.[23] D. Tran, H. Mac, V. Tong, H. A. Tran, and L. G. Nguyen, “A lstm basedframework for handling multiclass imbalance in dga botnet detection,”
Neurocomputing , vol. 275, pp. 2401–2413, 2018.[24] M. Pereira, S. Coleman, B. Yu, M. DeCock, and A. Nascimento, “Dic-tionary extraction and detection of algorithmically generated domainnames in passive dns traffic,” in
Intl. Symp. on Research in Attacks,Intrusions, and Defenses . Springer, 2018, pp. 295–314.[25] B. Yu, D. L. Gray, J. Pan, M. De Cock, and A. C. Nascimento, “Inlinedga detection with deep networks,” in . IEEE, 2017, pp. 683–692.[26] R. R. Curtin, A. B. Gardner, S. Grzonkowski, A. Kleymenov, andA. Mosquera, “Detecting dga domains with recurrent neural networksand side information,” in
Proc. of the 14th Intl. Conf. on Availability,Reliability and Security , 2019, pp. 1–10.[27] Y. Zhauniarovich, I. Khalil, T. Yu, and M. Dacier, “A survey on ma-licious domains detection through dns data analysis,”
ACM ComputingSurveys (CSUR) , vol. 51, no. 4, pp. 1–36, 2018.[28] L. Lilien and B. Bhargava, “A scheme for privacy-preserving datadissemination,”
IEEE Trans. on Systems, Man, and Cybernetics-Part A:Systems and Humans . IEEE, 2009, pp.202–213.[32] R. Ranchal, “Cross-domain data dissemination and policy enforcement,”2015.[33] C. N. Tun and K. T. Mya, “Secure spreadsheet data file transferringsystem.” 5th Local Conf. on Parallel and Soft Computing, 2010.[34] M. R. A. Mithu, V. Kholodilo, R. Manicavasagam, D. Ulybyshev, andM. Rogers, “Secure industrial control system with intrusion detection,”in
The 33rd Intl. Flairs Conf.
Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[38] S. Hochreiter, “The vanishing gradient problem during learning recur-rent neural nets and problem solutions,”
Intl. Journal of Uncertainty,Fuzziness and Knowledge-Based Systems , vol. 6, no. 02, pp. 107–116,1998.[39] F. A. Gers and E. Schmidhuber, “Lstm recurrent networks learn simplecontext-free and context-sensitive languages,”
IEEE Transactions onNeural Networks , vol. 12, no. 6, pp. 1333–1340, 2001.[40] (2020, September) Cosine similarity. [Online]. Available: https://en.wikipedia.org/wiki/Cosine similarity[41] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma-chine learning: from phenomena to black-box attacks using adversarialsamples,” arXiv preprint arXiv:1605.07277 , 2016.[42] (2020, September) Majestic million. [Online]. Available: https://majestic.com/reports[43] (2020, September) From research to production. [Online]. Available:https://pytorch.org/[44] M. Patella and P. Ciaccia, “Approximate similarity search: A multi-faceted problem,”