[PDF] Improving DGA-Based Malicious Domain Classifiers for Malware Defense with Adversarial Machine Learning

Abstract

Domain Generation Algorithms (DGAs) are used by adversaries to establish Command and Control (C\&C) server communications during cyber attacks. Blacklists of known/identified C\&C domains are often used as one of the defense mechanisms. However, since blacklists are static and generated by signature-based approaches, they can neither keep up nor detect never-seen-before malicious domain names. Due to this shortcoming of blacklist domain checking, machine learning algorithms have been used to address the problem to some extent. However, when training is performed with limited datasets, the algorithms are likely to fail in detecting new DGA variants. To mitigate this weakness, we successfully applied a DGA-based malicious domain classifier using the Long Short-Term Memory (LSTM) method with a novel feature engineering technique. Our model's performance shows a higher level of accuracy compared to a previously reported model from prior research. Additionally, we propose a new method using adversarial machine learning to generate never-before-seen malware-related domain families that can be used to illustrate the shortcomings of machine learning algorithms in this regard. Next, we augment the training dataset with new samples such that it makes training of the machine learning models more effective in detecting never-before-seen malicious domain name variants. Finally, to protect blacklists of malicious domain names from disclosure and tampering, we devise secure data containers that store blacklists and guarantee their protection against adversarial access and modifications.

Full PDF

IImproving DGA-Based Malicious DomainClassiﬁers for Malware Defense with AdversarialMachine Learning

Ibrahim Yilmaz, Ambareen Siraj, Denis Ulybyshev

Department of Computer ScienceTennessee Technological University

Cookeville, USAyilmaz42, asiraj, [email protected]

Abstract —Domain Generation Algorithms (DGAs) are usedby adversaries to establish Command and Control (C&C)server communications during cyber attacks. Blacklists ofknown/identiﬁed C&C domains are often used as one of thedefense mechanisms. However, since blacklists are static andgenerated by signature-based approaches, they can neither keepup nor detect never-seen-before malicious domain names. Due tothis shortcoming of blacklist domain checking, machine learningalgorithms have been used to address the problem to some extent.However, when training is performed with limited datasets, thealgorithms are likely to fail in detecting new DGA variants. Tomitigate this weakness, we successfully applied a DGA-basedmalicious domain classiﬁer using the Long Short-Term Memory(LSTM) method with a novel feature engineering technique. Ourmodel’s performance shows a higher level of accuracy comparedto a previously reported model from prior research. Additionally,we propose a new method using adversarial machine learning togenerate never-before-seen malware-related domain families thatcan be used to illustrate the shortcomings of machine learningalgorithms in this regard. Next, we augment the training datasetwith new samples such that it makes training of the machinelearning models more effective in detecting never-before-seenmalicious domain name variants. Finally, to protect blacklistsof malicious domain names from disclosure and tampering, wedevise secure data containers that store blacklists and guaranteetheir protection against adversarial access and modiﬁcations.

Index Terms —Domain Generation Algorithms, AdversarialMachine Learning, Long Short-Term Memory, Data Privacy

I. I

NTRODUCTION

Security researchers have developed many different defensemechanisms in order to protect computer systems againstmalware and malicious botnet C&C communications. Black-lists of malicious websites are one of the most commonlyused defense mechanisms where lists of domain names orIP addresses that are ﬂagged as harmful are maintained. Anymessages to/from these listed sites are feared to host potentialC&C servers and hence blocked to prevent any further com-munications. As a counterattack, attackers developed DGAsas a measure to thwart blacklist detection [1].In recent years, hacker communities have been utilizingDGAs as the primary mechanism to produce millions of ma-licious domain names automatically through pseudo-randomdomain names in a very short time period [2]. Subsets of these malicious domain names are utilized to map to the C&Cservers. These dynamically created domain names successfullyevade static blacklist-checking mechanisms. Additionally, asone domain gets recognized and blocked, the C&C server caneasily switch to another one.To overcome the limitations of static domain blacklists,machine learning (ML) techniques have been developed todetect malicious domain names and these techniques haveyielded mostly promising results [3], [4], [5]. However, theML models do not perform well with never-seen-before DGAfamilies when an unrepresentative or imbalanced trainingdataset is used. To address this problem, we propose a novelapproach to generate a rich set of training data representingmalicious domain names using a data augmentation technique.Data augmentation of an existing training dataset is oneway to make ML models learn better and, as a result, performmore robustly. Nevertheless, classic data augmentation merelycreates a restricted reasonable alternative. In our approach,as illustrated in Figure 1, an adversarial machine learningtechnique is used to generate a diverse set of augmenteddata by means of data perturbation. The generated adversarialdomain names are extremely difﬁcult to differentiate frombenign domain names. As a result, the machine learningclassiﬁer misclassiﬁes malicious domains as benign ones.Afterwards, these adversarial examples are correctly re-labeledas malicious and reintroduced into the existing training dataset(see Figure 1). In this way, we augment the blacklist withdiverse data to effectively train the machine learning modelsand increase the robustness of the DGA classiﬁers.In addition, we devise a secure container to store and trans-fer the blacklists of malicious domain names in encrypted formas a Protected Spreadsheet Container with Data (PROSPECD),presented in [6]. PROSPECD provides conﬁdentiality andintegrity of the blacklists so that they can be used as trainingdata to build a secure model. In addition to data integrity,PROSPECD provides origin integrity. This container protectsthe adversarial samples, used to teach the model, from un-known adversarial perturbations. The protected blacklist canbe marketed commercially [7] to internet service providers andcompanies who need to maintain their own internal blacklists. a r X i v : . [ c s . CR ] J a n ig. 1: Overview of the Proposed Methodology. A. Our Contributions

Blacklist is a security strategy that keeps network ﬂowand computer environments secure [8]. Typical network trafﬁcblacklists include malicious IP addresses or domain names,which are blocked from communication attempts in bothdirections. However, the coverage of blacklists is insufﬁcientand unreliable because adversarial hacker communities cancompromise a system by generating malware domains dynami-cally using DGAs that easily bypass the static blacklist. Kuhreret al. [9] evaluated 15 public malware blacklists as well as4 blacklists, served by antivirus vendors. Their ﬁndings showthat blacklists fail to protect systems against prevalent malwaresites generated by DGAs.In order to address the shortcomings of the above problem,researchers have mostly proposed solutions based on reverseengineering techniques to identify and block bot malware[10]. However, such solutions are not always feasible due tothe obfuscation of underlying algorithms, as hackers adapttheir algorithms swiftly to exploit the vulnerabilities in thesystem. Other alternative solutions require auxiliary contextualinformation. One of these alternative solutions focuses on thenetwork trafﬁc analysis [11], [12] or broad network packetexamination [13], [14]. However, these techniques may not beable to keep up with large-scale network trafﬁc. Therefore,there is a need for a sophisticated network trafﬁc analysis toolfor effective blacklisting.In response to the above issues, detection of maliciousdomains has increasingly evolved towards usage of machinelearning techniques. Performance of the solutions proposed toautomatically detect malicious domains mostly suffers fromnever-seen-before malicious domains. This is due to the lackof generalization when a model is not trained effectively withrepresentative or balanced training dataset. For this reason,blacklists must be constantly updated in order to identify andprevent DGA generated malicious domain connections.Data augmentation is an approach where more data iscreated from existing data in a way that can enhance theusefulness of the application. Anderson et al. [15] demon-strated how data can be augmented more effectively by usingan adversarial machine learning technique. The researchersenhanced adversarial domain names using Generative Adver-sarial Network (GAN) methodology. In their approach, twoneural networks are trained simultaneously, and later trainedwith a dataset which includes adversarial samples to harden theDGA classiﬁer. However, the main drawback of this approach is unpredictability of desirable results because of difﬁcultiesof controlling both classiﬁers at the same time, even thougha good optimization algorithm is used. As a result, it failsto always converge to a point of equilibrium to generatenew domain names. Additionally, controlling the diversityof produced samples is challenging with GAN models [16],[17]. In such cases, the newly generated data do not addto the diversity of the current data. Hence, this solutionalone cannot increase the malicious detection capabilities ofblacklists against never-before-seen DGA families.To improve the accuracy of such detection mechanisms, wepropose a new technique based on data perturbation withoutrelying on a fresh public blacklist or external reputationdatabase. In our approach, we observe how the model worksand use the knowledge to mislead the DGA classiﬁer. To dothis, a noise, carefully calculated from the observation is addedto the DGA based malicious domains to appear non-malicious.These adversarial samples are then predicted as benign by themachine learning (ML) model. Such adversarial attack can beaddressed with an adversarial training [18]. Therefore, aftercorrectly labeling these seemingly benign adversarial samples,the ML model is trained with the augmented dataset. Theexperimental results demonstrate that the retrained ML modelis able to detect never-before-seen DGA malwares better thanany other similar approaches.Our work has the following contributions: • Using a machine learning technique based on the LongShort-Term Memory (LSTM) model for automatic detec-tion of malicious domains using a DGA classiﬁer thatanalyses a massive labeled DGA dataset, character bycharacter over. • To the best of our knowledge it is the ﬁrst study to proposethe generation of malicious domain names using a dataperturbation approach in order to expand the trainingdataset with adversarial samples. • Demonstrating that, as expected, the LSTM model failsto recognize newly introduced adversarial samples inaugmented training dataset. • Applying an adversarial training to train the model withcorrect labelling of the adversarial samples in the trainingdataset to increase the model’s generalization ability. • Demonstrating that the augmented training dataset canhelp the LSTM model to detect not only never-seen-before DGAs, but also novel DGA families.The rest of this paper is organized as follows: the literatureeview in the context of our work is discussed in section II.The necessary background for DGA-based malicious domainmodels is reviewed in Section III. The core design of oursystem, including the adversarial machine learning modelsto generate malicious domain names and data containers tostore them, are presented in Section IV. We discuss theimplementation of the adversarial machine learning models inSection V. The evaluation results of our study are presentedin Section VI. Section VII concludes the paper.II. R

ELATED W ORK

Domain Generation Algorithms and detection of maliciousdomain names have been previously analyzed by differentresearchers for a number of years. Daniel et al. [19] presenteda taxonomy of DGA types by analyzing characteristics of43 different DGA-based malware families and compared theproperties of these families. They also implemented previousstudies with 18 million DGA domain data that was created toidentify malicious domains. They reported further progress inDGA detection.Detection of DGA botnets became feasible with the im-plementation of powerful machine learning models. Lisonet al. [20] implemented a recurrent neural network modelfor the detection of DGAs. Their empirical study detectedmalicious domain names with high accuracy. Justin et al. [21]deﬁned several models to detect malicious web sites, includinga logistic regression, support vector machine, and bayesianmodel. They used the DMOZ dataset for benign websiteswhile PhishTank and Spamscatter were used for maliciouswebsites. Bin et al. [22] addressed the same issue using dif-ferent machine learning classiﬁers for the detection of DGAs.They created a convolutional neural network (CNN) and arecurrent neural network (RNN) machine learning model forthe classiﬁcation of malicious and benign domain names. Theycompared the results of both models in terms of their perfor-mance and reported that both models performed comparably.Duc et al. [23] dealt with the multiclass imbalance problemof LSTM algorithms for the detection of malicious domainnames by generating DGAs. The authors claimed that theLSTM algorithms performed poorly with imbalanced datasets.To tackle this imbalanced dataset problem, they proposed aLong Short-Term Memory Multiclass Imbalance (LSTM.MI)algorithm and showed that their proposed algorithm providedmore accurate results by implementing different case studies.In addition, Mayana et al. [24] introduced a WordGraphmethod to recognize dictionary based malicious domains. Theauthors asserted that more sophisticated DGAs are able toavoid detection of conventional machine learning classiﬁers.They carried out their experiments by extracting dictionaryinformation without using reverse engineering. Bin et al. [25]deﬁned a deep neural network model as an inline DGAdetector. They caution about most of the available datasets notbeing good representations for malicious domains or outdated.Hence, machine learning models’ perform poorly when trainedusing such datasets. Furthermore, they explained that reverseengineering was a difﬁcult method for training models. To tackle these problems, the researchers offered a novel detectorfor malicious domains without the need for reverse engineer-ing. Their proposed technique was based on real trafﬁc andreported to detect malicious domains in real time. Woodbridgeet al. [3] created a machine learning classiﬁer based on theLSTM network to detect malicious domain names in real time.Their classiﬁer detected multiclass domains by categorizingthem into particular malware families. The model predicted adomain name as malicious or benign according to this domainname without any additional information.However, although these studies achieved high detectionrates for particular DGA families, the performance of machinelearning based detection systems are poor with new DGAvariants when the models are trained with unrepresentativeor imbalanced training datasets. To handle this issue, Ander-son et al. [15] offered a GAN algorithm to generate newdomain names. In their GAN approach, they implementedtwo different deep neural network classiﬁers named discrim-inator and generator. According to this GAN methodology,new malicious domain names are generated by the generator,which evades the discriminator’s detection. Their case studiesdemonstrated that new malicious domain names also bypassa random forest classiﬁer. Once the model was trained withadversarial samples, it hardened the model against new DGAfamilies. However, the authors did not test it on DGA familiescreated using a dictionary. Additionally, implementation ofthis approach is challenging due to the need of controllingtwo machine learning models, which might be unsuitablefor detecting malware-related domain names. Unlike this ap-proach, we propose to augment data by using an efﬁcientdata perturbation technique that generates hard-to-detect DGAfamilies and identiﬁes DGA types that are created eitherrandomly or using a dictionary.To overcome limitations of machine learning models inaforementioned circumstances, Curtin et al. [26] proposedto combine a neural network model with domain registra-tion supplementary information. This additional information,which is known as WHOIS data, helped the neural networkmodel to efﬁciently identify the most difﬁcult samples thatwere generated using English words. However, cybercriminalstake advantage of bulk registration services by registeringthousands of domain names in a short time, several monthsbefore the start of nefarious activities [27]. In addition, unau-thorized people can access this information and falsify it byimpersonating as legitimate users. This makes the informationquestionable. Compared to this, our approach efﬁciently de-tects DGA families solely based on the domain names, withoutrelying on any supplementary information.PROSPECD container used to store and transfer blacklistedmalicious domain names, is presented in [6]. Compared to theprivacy-preserving data dissemination concept as proposed byLilien and Bhargava [28], it has the following features: • Detection of several types of data leakages that can bemade by authorized entities to unauthorized ones; • Enforcement of access control policies either on a centralerver or locally on a client’s side in a Microsoft Excel ®1 Add-in or in a cross-platform application [6]. • Container implementation as a digitally signed water-marked Microsoft Excel ® -compatible spreadsheet ﬁlewith hidden and encrypted data and access control poli-cies worksheets. • On-the-ﬂy key derivation mechanism for data worksheets.The primary difference between PROSPECD and an ActiveBundle [30], [31], [32] is that PROSPECD does not store anembedded policy enforcement engine (Virtual Machine).In contrast with a solution to encrypt the desired cells ina spreadsheet ﬁle, proposed by Tun and Mya in [33], inPROSPECD all the data worksheets are encrypted with theseparate keys, generated on-the-ﬂy. PROSPECD supports role-based and attribute-based access control. Furthermore, digitaland visual watermarks are embedded in PROSPECD to enabledetection of data leakages. A Secure Data Container, pro-posed in [34] to store device state information, only supportscentralized policy enforcement. PROSPECD supports bothcentralized and local policy enforcement mechanisms [6].III. B

ACKGROUND

In this section, we review background information relatedto our research. Domain Generation Algorithm (DGA) : Domain gen-eration algorithms are primary means to connect variousfamilies of malware with new or never-before-seen domainsto avoid detection. There are many such DGA-based malwarefamilies (malware that connect to DGA generated domainnames). According to a study, the ﬁve most known familiesare

Conﬁcker , Murofet , BankPatch , Bonnana , and

Bobax [35].Although many DGA-based domain names are produced ran-domly, some are generated using a dictionary. The detectionof these types of domain names is more difﬁcult because oftheir similarity to legitimate domains. Gradient Descent Algorithm : The gradient descent al-gorithm is the most popular optimization method for usingmachine learning classiﬁers to minimize errors. It takes intoaccount the ﬁrst derivative when modifying all parametersunder considerations [36]. Gradient descent always strives toﬁnd the most appropriate way to minimize errors. The learningprocess starts with randomly producing weight values. Most ofthe time, these values are set to an initial value of zero and areused to calculate the lost function value. It then uses the gradi-ent descent algorithm to ﬁnd a way to reduce the lost function.All weights are updated through the backpropagation processbased on the gradient descent algorithm. We generate newadversarial domain names in our data augmentation methodby utilizing and modifying the gradient descent algorithm’sbehavior. In Section V, explanation of how such adversarialsamples are created is discussed in detail. This paper is an independent publication and is neither afﬁliated with, norauthorized, sponsored, or approved by, Microsoft Corporation [29] Long Short-Term Memory (LSTM) Model : The LSTMmodel, a specialized Recurrent Neural Network (RNN), isused in our approach for automatic detection of maliciousdomains. RNNs are known as supervised machine learningmodels that are commonly used to handle the processing ofsequential data [37]. An RNN takes the previous and currentinputs into account while the traditional networks consider allinputs independently. In our study, we implement the modelon a character by character basis, so that it captures theimportance of the order of the characters’ occurrence in theword. Essentially, the model learns the occurrences of thecharacters in a sequential way. For example, for the domainname google , without the top-level domain, the model ﬁrstlearns the character g , then o , predicting that it succeeds g .However, traditional neural networks do not take the positionof the characters into account. In addition, the traditionalneural networks process ﬁxed-sized input. Similarly, they cancreate ﬁxed-size output, whereas RNN does not have suchlimitations.When input data like domain names have long term se-quences, traditional RNN struggles to learn such long datadependencies, which is known as the vanishing gradient prob-lem [38]. In order to avoid this problem, LSTM, which is aspecial kind of RNN, was introduced in [37].LSTM relies on the gating mechanisms, where the in-formation can be read, written or erased via a set ofprogrammable gates. This allows recurrent nets to keeptrack of information over many time-steps and gain theability to preserve long term dependencies. For example, is a malicious domainname, which has a 32 character length. LSTM keeps track ofthe relevant information about these characters throughout theprocess. Furthermore, a study has shown LSTMs outperformprevious RNNs for the solution of both context-free language(CFL) and regular language problems [39]. Researchers havereported that LSTMs generalize better and faster, leading tothe creation of more effective models. In our study, the LSTMmodel learns how to detect DGA-based malicious domains ina similar way to what is mentioned above.IV. C ORE D ESIGN

A. Generating New Malicious Domain Names

Machine learning models’ performances substantially relyon the training dataset crucial for building effective classi-ﬁers. However, one of the biggest challenges with any MLmodel is accumulating a training dataset that is representativeand balanced enough to enable the creation of an effectivemachine learning model. This process might be costly ortime-consuming or both. A restrictive training dataset canlead to poor performance of the ML model and that isthe primary reason DGA classiﬁers do not work well withautomated malicious domain name detection. With traditionalblacklists used in training, ML classiﬁers cannot detect never-before-seen DGA families. The model needs to be readjustedconstantly with new variations of training data for effectivethreat detection. To address this issue, we propose to create alacklist with domain names using a novel adversarial machinelearning technique.Our adversarial approach is based on data perturbation tech-niques inspired by [18], where domain names are inﬂuencedbased on the gradient descent algorithm of a targeted DGAclassiﬁer with regards to the classiﬁer loss. Changing thegradient of the classiﬁer maximizes loss function instead ofminimizing it can mislead the malware detection classiﬁers.Even though these domains are malicious, the DGA classiﬁer,based on the LSTM model, predicts them as benign. As aresult, new adversarial domain are generated that can seembenign by not matching the blacklist data. Our method isclariﬁed mathematically below [18].Let x be a given malicious domain name, y be labeled asmalicious, M be a DGA classiﬁer that M(x): x → y, ˆ x iscrafted domain name using our adversarial attack and ˆ y is theclass label that M(x): ˆ x → ˆ yobjective max l ( M, ˆ x, y ) (1) y (cid:54) = ˆ y (2) subject to ˆ x = x + δx (3)Here l(M.x,y) is the loss function of the DGA classiﬁer in(1). A newly created adversarial domain name is predictedas benign by the DGA classiﬁer in (2). δ x in (3) representsperturbation added to a vector form of the given domain name. δ x is calculated below [18]: δx = (cid:15) sign ( ∇ x l ( M, x, y )) (4)Here sign ( ∇ x l(M,x,y) ) represents the direction of theloss function, which minimizes the loss function of the DGAclassiﬁer, and (cid:15) controls the expansion of the noise in (4).The smaller epsilon value perturbs the original feature vectorslightly, while the larger one perturbs the original featurevector signiﬁcantly. This misleads the DGA classiﬁer to a greatextent. On the other hand, a larger perturbation can be moreeasily detected than a smaller one by human eyes.The calculated noise is included in each character embed-ding of input data. The resulting embedding characters withnoisy calculations are compared to each character’s embeddingusing cosine similarity [40] to measure each distance. The ﬁnalcharacter is chosen based on this operation. The design of themodel is demonstrated in Figure 2. According to the examplein the ﬁgure, character g turns to c with our use of adversariallearning. Our adversarial domain name generation algorithmis summarized in Algorithm 1.The DGA detectors can be seen as black-box devices in real-world settings. Since, in the black-box scenario, an adversarydoes not have any knowledge about the inner workings ofthe target model. Nevertheless, for the sake of simplicity,we implement our proposed technique under the white-boxassumption, where we obtain optimum perturbation by ac-cessing the target model so that we can compute gradients.Although the black-box assumptions can be perceived as morerealistic for this work, it is important to keep in mind that Transpose of thenew embedding ⋅ ( ( , , ))∇ . ...... ⋮ Embedding of character CalculatedNoise New Embedding ⋮ ⋮ Maximum element of the arraywhich corresponds to the mostsimilar character to is selected + ⋅ ( ( , , )) = ∇ EmbeddingMatrix = =+

Fig. 2: Generation of Adversarial Samples Using Character-Level Transformations.

ALGORITHM 1:

Pseudocode of our proposed ad-versarial domain name generation approach Input: { X i , Y i } where X i = Eachdomain name and Y i = Correspondingground-truth label3 - X =( x || x ||...|| x n ) where x i = Eachcharacter of a given input domain name4 - Training iteration number N itr , Numberof adversarial examples N adv , Number oftraining samples N train , Number ofcharacter of a given input domain name N total { X j , Y j } Function generate domain names7 for iteration = 0, ..., N itr do end for iteration = 0, ..., N adv do for iteration = 0, ..., N total do δ x i = (cid:15) × sign ( ∇ x i l ( M, x i , y i )) z i = x i + δ x i ˆ x max = z i • x √ z i × x i end Output: ˆ X =( ˆ x || ˆ x ||...|| ˆ x n ) end end previous studies showed that adversarial samples have the transferarability property [41]. This means that an adversarialexample generated for one DGA model is more likely to bemisclassiﬁed by another DGA detector as well, since whendifferent ML models are trained with the similar dataset fromthe same source, they learn similar decision boundaries. Weleave testing adversarial examples the under black-box settingsor future work. B. Protected Spreadsheet Container with Data (PROSPECD)for Domain Names Blacklists

We propose to use a PROSPECD data container, presentedin [6], to securely store and transfer blacklisted maliciousdomain names. In our use case, PROSPECD, implementedas an encrypted and digitally signed spreadsheet ﬁle, containsthe following watermarked data worksheets: • ”Domain Blacklist” to store encrypted malicious domainnames, detected by our classiﬁer; • ”Metadata” to store encrypted metadata, which includeaccess control policies; • ”General Info” to store encrypted information about theclassiﬁer used to detect the malicious domain names andits execution details.PROSPECD provides data conﬁdentiality and integrity,origin integrity, role-based and attribute-based access controland centralized and decentralized enforcement of accesscontrol policies. Digital and visual watermarks, embeddedinto a PROSPECD spreadsheet ﬁle, enable detection of severaltypes of data leakages that can be made behind-the-scenes byauthorized parties to unauthorized ones [6]. PROSPECD Generator.

The malicious domain names clas-siﬁer runs on a trusted server. Once the blacklist of domainnames is generated, the dedicated process writes it, as wellas the relevant information, in a spreadsheet ﬁle. Then thePROSPECD generator, currently implemented as a commandline utility, is called. It takes a spreadsheet ﬁle with the ”Do-main Blacklist” worksheet and two other worksheets (”GeneralInfo” and ”Metadata”) in a plaintext form, as an input, andgenerates a separate spreadsheet ﬁle with encrypted work-sheets. Each separate worksheet is encrypted with a separatesymmetric 256-bit AES key, generated on-the-ﬂy [6].

PROSPECD Data Access on a Trusted Server.

ThePROSPECD container, stored on a trusted server, can beaccessed remotely from a web viewer. The client opens theAuthentication Server (AS)’s URL in a web browser, selectsthe data subset to retrieve (”Domain Blacklist”, ”General Info”or ”All”) and enters their credentials: username (role) andpassword. The accessible worksheets from PROSPECD aredecrypted, using the on-the-ﬂy AES key derivation scheme,based on the client’s role and attributes. These attributesinclude versions of a web browser and an operating system,as well as the type of device the client uses. Decryptedworksheets are sent to the client as a JSON object via httpscommunication channel [6]. PROSPECD supports RESTfulAPI. Table I shows the access control policies. The role”User” can only access blacklisted domain names from the”Domain Blacklist” worksheet. The role ”Administrator” isallowed to access all the worksheets and also to download thePROSPECD ﬁle from the server to their local device, to accessthe data locally or transfer it to other parties.

Local PROSPECD Data Access.

Authorized parties canaccess PROSPECD data locally, either from the Microsoft TABLE I: PROSPECD Access Control Policies

Domain Blacklist General Info MetadataAdministrator YES YES YESUser YES NO NO

Excel ® Add-in or from the standalone cross-platform appli-cation. For the ﬁrst option, a user needs to ”download andinstall the Microsoft ® Excel Add-in” [6], written in C ® , opena PROSPECD container and enter valid credentials. Then thePROSPECD’s digital signature is veriﬁed. If it is valid, the”Metadata” worksheet is decrypted and access control policies,stored in this worksheet and shown in Table I, are evaluated.Then the decryption keys for accessible data worksheets arederived [6]. To prevent unauthorized data disclosures, theauthenticated user is not allowed to print or save the openedspreadsheet ﬁle once the Add-in has been launched. When auser closes an application, all the data are encrypted back totheir original values and the visibility of all the worksheets isreset back to VeryHidden ® , after which the application closes.In addition to the Microsoft ® Excel Add-in, a cross-platformapplication was developed to view PROSPECD data [6]. Thisapplication provides a graphical user interface and does notallow the user to store the decrypted PROSPECD ﬁles locally,to prevent possible data leakages.V. E

XPERIMENTAL M ETHODOLOGY

This section describes the dataset used to build a DGAclassiﬁer based on an LSTM model, along with the explanationof the model implementation.

A. Dataset

The experimental dataset includes one non-DGA (benigndomain names) and 68 DGA families (malicious domainnames). Data is collected from two different sources, whichare publicly available [42], [19].For benign domains, we use the Majestic top 1 milliondataset [42]. This dataset includes the top one million web-site domain names all over the world and the dataset isupdated daily. For malicious domains, we obtain data fromthe DGArchive, which is a web repository of DGA-basedmalware families [19]. This repository has over 18 millionDGA domains. We have worked on 68 DGA malware fam-ilies with some being generated by traditional DGAs. Theremaining families were produced by dictionary DGAs. Weused both traditional-based DGAs and dictionary-based DGAswith over a hundred thousand malicious domains. To ensure afair comparison, we used a subset of 135K samples from theMajestic top 1 million dataset so that the classiﬁer does notbias towards the majority class and thus, prevent occurrenceof overﬁtting.

B. ML Model Implementation

We implement our LSTM model in Python using Pytorch[43]. We use the LSTM at character level with application ofcharacter embeddings (vector forms of the characters). Thiseans that we every character is encoded to a representativevector. We convert from the word spaces to vector spaces toextract better features for the machine learning classiﬁer. EachASCII character (total number is 256) represents a vectorwhose size is set to 256. In this way, we create a 256 by256 embedding matrix where each row represents a characterand character embeddings are represented by column vectorsin the embedding matrix. Once the perturbation technique isemployed, embedded vectors of the character of maliciousdomains are transformed into word space again.We divided the dataset into training and test data. We use90% of the dataset for training and remaining is reserved fortesting. The training is used by the model to learn detectionof DGA-based malicious domains. In our implementation,only the domain names are considered by the model, andthe characters are pulled from the domain names character bycharacter. At each time interval, one character’s correspondingvector is fed into the LSTM model. The character embeddingsare randomly initialized at the beginning. The model is able tolearn through the dependencies that is has with each other andthe conditional probabilities of the aforementioned characters.Thus, each character’s embedding is learned by the LSTMmodel itself and the matrix is ﬁlled with these embeddings. Inthe test phase, the unseen data is predicted by the model asmalicious or benign. The model’s performance is analyzed insection VI in detail.In this work, our main goal is to augment the training datasetto increase the model’s resiliency and improve performancefor detection of never-before-seen or yet-to-be-observed DGAfamilies. To do this, an optimally calculated noise is addedto each character embedding of the input data by the dataperturbation technique. The newly created embedding with theaddition of noise may not be assigned to any character. There-fore, the model looks for the closest embedding character andassigns it as the character to that corresponding embedding.Here we use approximate similarity search [44] by applyingc osine similarity [40] that takes dot products between newlycreated embedding and each row of the embedding matrixto calculate the similarity. The new character is assigned tothe row matrix’s corresponding character, yielding maximumsimilarity value.Technically speaking, the LSTM model consists of twohidden layers along with the input and output layers. The dropout , a known regularization technique, is used with a rate of0.5 in order to avoid overﬁtting. The ﬁne-tuned parameters arefound by using a batch size of 128 and a learning rate of 0.001along with an epoch number of 6. To achieve learning rateslower than this, more iterations may be needed.In addition, we use the adam optimization algorithm anextension to stochastic gradient descent, to minimize theerror by adjusting network hyper parameters in an iterativeway. Furthermore, binary cross entropy utilized for binaryclassiﬁcation, is used as the loss function in order to measurethe cost over probability distribution of malicious and benigndomain names. VI. E

VALUATION

The results of our experiments are divided into two sections.Firstly, we evaluate the performance of the newly proposedLSTM model and compare it with a previous work known asDeepDGA in terms of model accuracy [15]. In addition, wereport on how the model accuracy changes against temperingof input samples in order to generate adversarial instances.Finally, we analyze the DGA classiﬁer before and after adver-sarial augmentation of training data.At ﬁrst, a binary classiﬁcation, which simply predicts be-tween DGA-based malicious or benign (Alexa top 135K)samples are applied. Table II demonstrates a comparisonbetween the performance of the Deep-DGA model and ourLSTM-based model. The detection rates of Cryptolocker andDicrypt is higher with DeepDGA than our DGA classiﬁer withthe available samples. On the other hand, Locky V2, Pykspa,Ramdo and Simda are detected with better accuracy by ourclassiﬁer, and the rest of the cases show the same detectionrate for both. Even though the results demonstrated that theimprovement was not substantial, the results could have turnedout to be different, since we used a different dataset thanDeepDGA.TABLE II: Deep-DGA and the Proposed Model Comparison

Deep-DGA The Proposed DGA DetectorCorebot 1.0 1.0Cryptolocker

Pykspa 0.85

Qakbot 0.99 0.99Ramdo 0.99

Ramnit 0.98 0.98Simda 0.96

Average 0.97

A. LSTM model results

Table III shows the resulting detection rates for our modelfor 68 DGA families. Our ﬁndings show that our methodperformed with high accuracy (usually above the 0.97 accuracymargin) for most of the DGA-based malware families.We also evaluate the DGA classiﬁer performance consid-ering standard metrics such as precision, recall, F1-score,false positive rate (FPR), false negative rate (FNR) and areaunder the receiver operating characteristic curve (AUC). Theseevaluation metrics are widely used to measure the qualityof the ML model. Using the proposed algorithm, we craftdomain names from both benign (Alexa 135K) and DGA-based malicious samples. Based on different epsilon values,the changes in these evaluation metrics of the model can beviewed in Table VI. Initially, we set the epsilon value tozero to observe the actual performance of the model. Ourﬁndings show that the model performs well in terms of theaforementioned metrics. In case of the adversarial samples thatwere generated from malicious domain names, the accuracyrate of the LSTM model degrades with increase in epsilonvalue until it stabilizes at an equilibrium, because, at that

ABLE I: Detection rate of each DGA malware family usingthe LSTM model

Detection Rate Number ofSamplesBamital . Banjori . beebone . Blackhole . Bobax . Chir . Corebot . Darkshell . Dyre . Ebury . Emotet . Feodo . Gameover . Gameover P2P . Gspy . Infy . Modpack . Murofetweekly . Murofet . Pandabanker . Ramdo . Ranbyus . Redyms . Rovnix . Sisron . Sutra . Tsiﬁri . Ud2 . Ud3 . Vidrotid . Wd . Xshellghost . Xxhex . Chinad .

99 1129

Diamondfox .

99 829

Locky V2 .

99 3549

Oderoor .

99 1200

Padcry .

99 1220

Qadars .

99 2000

Qakbot .

99 4000

Sphinx .

99 2000

Tinba .

99 1998

Cryolocker .

98 4129

Dircrypt .

98 500

Ramnit .

98 1200

Simda .

98 2000

Szribi .

98 1200

Volatilecedar .

98 498

Bedep .

97 1028

Ekforward .

97 578

Fobber .

97 725

Pushdotid .

97 900

Pykspa 2 .

97 1200

Urlzone .

97 2000

Necurs .

96 4201

Dnschanger .

94 1228

Suppobox .

94 12 000

Tempedrevetdd .

94 1000

Proslikefan .

93 1098

Torpig .

93 1200

Vidro .

93 1276

Mirai .

92 500

Pykspa 2S .

92 1200

Pykspa .

90 1201

Ud4 .

90 100

Hesperbot .

87 150

Shifu .

87 2000

Nymaim .

85 1200

TABLE II: Detection rate of each DGA malware family beforeand after training data augmentation

Detection Ratewith InjectedAttack Samples(Epsilon=11) Detection Ratewith Re-labelledAttack Samples ImprovementRate %Beebone . . Tsiﬁri . . Ramdo .

04 1 . Sisron .

04 1 . Redyms .

07 1 . Ebury .

05 0 .

97 92

Szribi .

08 1 . Simda .

08 0 .

99 91

Ud4 . .

90 90

Ranbyus .

14 1 . Vidrotid .

14 1 . Pykspa 2 .

09 0 .

93 84

Ud3 .

17 1 . Cryptolocker .

17 0 .

98 81

Fobber .

19 1 . Pykspa 2S .

17 0 .

96 79

Darkshell .

04 0 .

80 76

Suppobox .

22 0 .

98 76

Urlzone .

30 1 . Corebot .

32 1 . Locky V2 .

28 0 .

96 68

Necurs .

31 0 .

98 67

Volatilecedar .

03 0 .

70 67

Vidro .

31 0 .

95 64

Hesperbot .

17 0 .

80 63

Padcrypt .

35 0 .

98 63

Bedep .

37 0 .

99 62

Emotet .

38 1 . Pykspa .

29 0 .

91 62

Tempedrevetdd .

35 0 .

96 61

Dnschanger .

38 0 .

98 60

Oderoor .

37 0 .

97 60

Proslikefan .

35 0 .

94 59

Pushdotid .

30 0 .

89 59

Qadars .

42 1 . Sphinx .

41 0 .

99 58

Bobax .

43 1 . Diamondfox .

31 0 .

84 53

Dircrypt .

44 0 .

96 52

Feodo .

48 1 . Sutra .

49 1 . Ramnit .

48 0 .

98 50

Qakbot .

52 1 . Mirai .

49 0 .

96 47

Modpack .

53 1 . Shifu .

43 0 .

89 46

Nymaim .

42 0 .

87 45

Xshellghost .

31 0 .

75 44

Torpig .

39 0 .

79 40

Rovnix .

64 1 . Blackhole .

66 1 . Gameover .

68 1 . Murofet .

69 1 . Tinba .

75 1 . Chinad .

76 1 . Banjori .

77 1 . Pandabanker .

78 1 . Dyre .

81 1 . Ekforward .

82 1 . Gameover P2P .

83 1 . Infy .

84 1 . Xxhex .

85 1 . Chir .

98 1 . Gspy .

98 1 . Bamital .

99 1 . Murofetweekly .

99 1 . Ud2 .

99 1 . Wd .

99 1 . TABLE III: Detection Rate of DGA Malware Families Usingthe LSTM Model. TABLE IV: Detection Rate of DGA Malware Families Beforeand After Training Data Augmentation.

ABLE V: Transformation of Alexa Domain Name Samples.

Malicious Benign

Epsilon Accuracy Precision Recall F1 FPR FNR AUC Epsilon Accuracy Precision Recall F1 FPR FNR AUC0 0.98 0.98 0.99 0.98 0.03 0.01 0.99 0 0.98 0.98 0.97 0.97 0.01 0.02 0.981 0.95 0.95 0.95 0.95 0.13 0.09 0.94 0.7 0.70 0.69 0.65 0.65 0.40 0.15 0.70

TABLE VI: Performance of the DGA Classiﬁer vs. Penetration Coefﬁcient for Both Benign and Malicious Domains.point, the model has been trained well enough to recognizemalicious domains. In addition, the dissimilarities between thebenign class and the malicious class drastically increase. Thisindicates the limit of misclassiﬁcation, even with increasingepsilon values.We also consider benign instances as an input to corruptthe benign samples for creating the adversarial domain names.Subtle perturbations do not decrease an accuracy much, sincethe injected epsilon values do not manipulate the original datasufﬁciently to cause misclassiﬁcation. When we continue to in-crease the penetration coefﬁcient that causes slight differencesto the original benign data, the model fails to recognize thesechanges. Therefore, the model performance is dramaticallyimpaired. As we further scale up the noise, the model startsto predict these drastic modiﬁcations more accurately, due tosevere degradation of the input. Table V shows how the domainname samples are transformed by the epsilon values.It is noteworthy to observe the decreasing accuracy of ourDGA classiﬁer as we add perturbations because adversarialexamples mislead the model into making incorrect decisionsthat increase the number of false positives and false negatives.The various adversarial samples that are created by injectingnoise has the potential to deceive the LSTM model evenmore than the adversarial samples generated by GAN. Inthe study of Anderson et al. [15], it was found that thedetection rate of the model is 48.0%, which means that theyachieved an attack success rate of about 53%. Table VI showsinstances of LSTM model’s accuracy to be around 45% withdifferent perturbation coefﬁcients. We achieved the highestattack success rate of 56%, which is higher than the GANapproach by 3%, indicating that our model generated DGAfamilies are able to deceive the ML model more effectively.

B. Improving the LSTM Model with Augmented Training Data

As discussed above, we are able to successfully produceadversarial domain names that can bypass detection by theLSTM model. We show that successful augmentation oftraining data samples can be done with our proposed method. Changes in penetration coefﬁcients can impact the DGAclassiﬁer to different extents in terms of the model accuracy.We later modify the dataset by injecting correctly labelledadversarial domains. We replace every malicious trainingsamples with its adversarial counterpart including the topAlexa 135K in the training set, and re-train the model. TableIV illustrates the differences before and after training withadversarial samples when the epsilon value is 11. Our reasonfor selecting the value 11 for the epsilon is to illustrate themaximum damage to the well-trained LSTM model and howtraining the model with augmented data performs much better.When the model is trained with adversarial samples, themodel is able to detect unseen malicious samples in thetraining set to a larger extent. The hardened classiﬁer increasesthe model’s detection ability for each DGA family, as can beseen from Table IV. As noted for some family groups, suchas Bamital, Gspy, and Ud2, the adversarial manipulation didnot have any signiﬁcant impact on model accuracy (within1%). However, for most others, the training with augmenteddata boosted accuracy immensely, on some occasions reachingup to 100%. As a result, the model trained with adversarialsamples has shown to perform much more accurately, close tothe performance of the model before adversarial manipulation.VII. C

ONCLUSIONS

In this paper, we presented a novel detection system basedon an LSTM model for the detection of both traditional anddictionary-based DGA generated domains using the character-by-character approach. Our experimental ﬁndings show thatthe LSTM-based model can detect malicious domains with atrivial margin of error.However, machine learning models are unable to learn thecharacteristic behaviors of DGA-based malicious domains ifthere are new or never-seen-before data in the testing dataset.In this study, we highlight this issue with an adversarial ma-nipulation using different data perturbation cases. Accordingto our ﬁndings, newly generated domains using the proposedperturbation approach could not be detected by the DGA clas-siﬁer. After we trained the model with the augmented trainingataset, including adversarial samples, the experimental resultsshow that the LSTM model was able to detect previouslyunobserved DGA families.We store malicious domain names, detected by our model,in a Protected Spreadsheet Container with Data (PROSPECD).It provides data conﬁdentiality and integrity, as well as ori-gin integrity, role-based and attribute-based access control.PROSPECD protects the domain names in transit and at restagainst adversarial access and modiﬁcations.R

EFERENCES[1] (2020, September) Domain generation algorithm. [Online]. Available:https://en.wikipedia.org/wiki/Domain generation algorithm[2] S. Yadav, A. K. K. Reddy, A. Reddy, and S. Ranjan, “Detectingalgorithmically generated malicious domain names,” in

Proc. of the 10thACM SIGCOMM Conf. on Internet measurement . ACM, 2010, pp. 48–61.[3] J. Woodbridge, H. S. Anderson, A. Ahuja, and D. Grant, “Predictingdomain generation algorithms with long short-term memory networks,” arXiv preprint arXiv:1611.00791 , 2016.[4] I. Yilmaz, R. Masum, and A. Siraj, “Addressing imbalanced dataproblem with generative adversarial network for intrusion detection,”in . IEEE Computer Society, 2020, pp. 25–30.[5] I. Yilmaz, “Practical fast gradient sign attack against mammographicimage classiﬁer,” arXiv preprint arXiv:2001.09610 , 2020.[6] D. Ulybyshev, C. Bare, K. Bellisario, V. Kholodilo, B. Northern,A. Solanki, and T. O’Donnell, “Protecting electronic health records intransit and at rest,” in

Intl. Workshop on Recent Advancesin Intrusion Detection . Springer, 2014, pp. 1–21.[10] M. Ligh, S. Adair, B. Hartstein, and M. Richard,

Malware analyst’scookbook and DVD: tools and techniques for ﬁghting malicious code .Wiley Publishing, 2010.[11] J. Zhang, R. Perdisci, W. Lee, U. Sarfraz, and X. Luo, “Detect-ing stealthy p2p botnets using statistical trafﬁc ﬁngerprints,” in .IEEE, 2011, pp. 121–132.[12] T.-F. Yen and M. K. Reiter, “Are your hosts trading or plotting? tellingp2p ﬁle-sharing and bots apart,” in . IEEE, 2010, pp. 241–252.[13] J. Manni, A. Aziz, F. Gong, U. Loganathan, and M. Amin, “Network-based binary ﬁle extraction and analysis for malware detection,” Jan. 132015, uS Patent 8,935,779.[14] A. Aziz, H. Uyeno, J. Manni, A. Sukhera, and S. Staniford, “Electronicmessage analysis for malware detection,” Jul. 17 2018, uS Patent10,027,690.[15] H. S. Anderson, J. Woodbridge, and B. Filar, “Deepdga: Adversarially-tuned domain generation and detection,” in

Proc. of the 2016 ACMWorkshop on Artiﬁcial Intelligence and Security . ACM, 2016, pp.13–21.[16] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilib-rium generative adversarial networks,” arXiv preprint arXiv:1703.10717 ,2017.[17] I. Yilmaz and R. Masum, “Expansion of cyber attack data fromunbalanced datasets using generative techniques,” arXiv preprintarXiv:1912.04549 , 2019.[18] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[19] D. Plohmann, K. Yakdan, M. Klatt, J. Bader, and E. Gerhards-Padilla,“A comprehensive measurement study of domain generating malware,”in { USENIX } Security Symp. ( { USENIX } Security 16) , 2016, pp.263–278. [20] P. Lison and V. Mavroeidis, “Automatic detection of malware-generated domains with recurrent neural models,” arXiv preprintarXiv:1709.07102 , 2017.[21] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists:learning to detect malicious web sites from suspicious urls,” in

Proc. ofthe 15th ACM SIGKDD Intl. Conf. on Knowledge discovery and datamining . ACM, 2009, pp. 1245–1254.[22] B. Yu, J. Pan, J. Hu, A. Nascimento, and M. De Cock, “Character levelbased detection of dga domain names,” in . IEEE, 2018, pp. 1–8.[23] D. Tran, H. Mac, V. Tong, H. A. Tran, and L. G. Nguyen, “A lstm basedframework for handling multiclass imbalance in dga botnet detection,”

Neurocomputing , vol. 275, pp. 2401–2413, 2018.[24] M. Pereira, S. Coleman, B. Yu, M. DeCock, and A. Nascimento, “Dic-tionary extraction and detection of algorithmically generated domainnames in passive dns trafﬁc,” in

Intl. Symp. on Research in Attacks,Intrusions, and Defenses . Springer, 2018, pp. 295–314.[25] B. Yu, D. L. Gray, J. Pan, M. De Cock, and A. C. Nascimento, “Inlinedga detection with deep networks,” in . IEEE, 2017, pp. 683–692.[26] R. R. Curtin, A. B. Gardner, S. Grzonkowski, A. Kleymenov, andA. Mosquera, “Detecting dga domains with recurrent neural networksand side information,” in

Proc. of the 14th Intl. Conf. on Availability,Reliability and Security , 2019, pp. 1–10.[27] Y. Zhauniarovich, I. Khalil, T. Yu, and M. Dacier, “A survey on ma-licious domains detection through dns data analysis,”

ACM ComputingSurveys (CSUR) , vol. 51, no. 4, pp. 1–36, 2018.[28] L. Lilien and B. Bhargava, “A scheme for privacy-preserving datadissemination,”

IEEE Trans. on Systems, Man, and Cybernetics-Part A:Systems and Humans . IEEE, 2009, pp.202–213.[32] R. Ranchal, “Cross-domain data dissemination and policy enforcement,”2015.[33] C. N. Tun and K. T. Mya, “Secure spreadsheet data ﬁle transferringsystem.” 5th Local Conf. on Parallel and Soft Computing, 2010.[34] M. R. A. Mithu, V. Kholodilo, R. Manicavasagam, D. Ulybyshev, andM. Rogers, “Secure industrial control system with intrusion detection,”in

The 33rd Intl. Flairs Conf.

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[38] S. Hochreiter, “The vanishing gradient problem during learning recur-rent neural nets and problem solutions,”

Intl. Journal of Uncertainty,Fuzziness and Knowledge-Based Systems , vol. 6, no. 02, pp. 107–116,1998.[39] F. A. Gers and E. Schmidhuber, “Lstm recurrent networks learn simplecontext-free and context-sensitive languages,”

IEEE Transactions onNeural Networks , vol. 12, no. 6, pp. 1333–1340, 2001.[40] (2020, September) Cosine similarity. [Online]. Available: https://en.wikipedia.org/wiki/Cosine similarity[41] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma-chine learning: from phenomena to black-box attacks using adversarialsamples,” arXiv preprint arXiv:1605.07277 , 2016.[42] (2020, September) Majestic million. [Online]. Available: https://majestic.com/reports[43] (2020, September) From research to production. [Online]. Available:https://pytorch.org/[44] M. Patella and P. Ciaccia, “Approximate similarity search: A multi-faceted problem,”