[PDF] Domain-Embeddings Based DGA Detection with Incremental Training Method

Abstract

DGA-based botnet, which uses Domain Generation Algorithms (DGAs) to evade supervision, has become a part of the most destructive threats to network security. Over the past decades, a wealth of defense mechanisms focusing on domain features have emerged to address the problem. Nonetheless, DGA detection remains a daunting and challenging task due to the big data nature of Internet traffic and the potential fact that the linguistic features extracted only from the domain names are insufficient and the enemies could easily forge them to disturb detection. In this paper, we propose a novel DGA detection system which employs an incremental word-embeddings method to capture the interactions between end hosts and domains, characterize time-series patterns of DNS queries for each IP address and therefore explore temporal similarities between domains. We carefully modify the Word2Vec algorithm and leverage it to automatically learn dynamic and discriminative feature representations for over 1.9 million domains, and develop an simple classifier for distinguishing malicious domains from the benign. Given the ability to identify temporal patterns of domains and update models incrementally, the proposed scheme makes the progress towards adapting to the changing and evolving strategies of DGA domains. Our system is evaluated and compared with the state-of-art system FANCI and two deep-learning methods CNN and LSTM, with data from a large university's network named TUNET. The results suggest that our system outperforms the strong competitors by a large margin on multiple metrics and meanwhile achieves a remarkable speed-up on model updating.

Full PDF

DDomain-Embeddings Based DGA Detection withIncremental Training Method st Xin Fang, 2 nd Xiaoqing Sun,3 rd Jiahai Yang

Institute for Network Sciences and CyberspaceTsinghua University , Beijing, ChinaBeijing National Research Center for InformationScience and Technology { fx18, sxq16 } @mails.tsinghua.edu.cn, [email protected] th Xinran Liu

National Computer Network Emergency Response TechnicalTeam / Coordination Center , Beijing, [email protected]

Abstract —DGA-based botnet, which uses Domain GenerationAlgorithms (DGAs) to evade supervision, has become a part ofthe most destructive threats to network security. Over the pastdecades, a wealth of defense mechanisms focusing on domainfeatures have emerged to address the problem. Nonetheless, DGAdetection remains a daunting and challenging task due to thebig data nature of Internet tra ﬃ c and the potential fact thatthe linguistic features extracted only from the domain names areinsu ﬃ cient and the enemies could easily forge them to disturbdetection. In this paper, we propose a novel DGA detectionsystem which employs an incremental word-embeddings methodto capture the interactions between end hosts and domains,characterize time-series patterns of DNS queries for each IPaddress and therefore explore temporal similarities betweendomains. We carefully modify the Word2Vec algorithm andleverage it to automatically learn dynamic and discriminativefeature representations for over 1.9 million domains, and developan simple classiﬁer for distinguishing malicious domains from thebenign. Given the ability to identify temporal patterns of domainsand update models incrementally, the proposed scheme makes theprogress towards adapting to the changing and evolving strategiesof DGA domains. Our system is evaluated and compared withthe state-of-art system FANCI and two deep-learning methodsCNN and LSTM, with data from a large university’s networknamed TUNET. The results suggest that our system outperformsthe strong competitors by a large margin on multiple metrics andmeanwhile achieves a remarkable speed-up on model updating. Index Terms —Domain-Embeddings, DGA Detection,Word2vec, Incremental Training

I. I ntroduction

DGAs are commonly used by botnets to bypass securitymechanisms where some static methods like blacklists areemployed. They can generate a vast amount of pseudo-randomdomain names, while the attacker would only select a smallsubset for registration to establish command and control(C&C) connections [1] [2].This results in an asymmetric sit-uation where attackers can use any one of generated domainsto control bots, but defenders must monitor all of them. Awide spectrum of methods for DGA detection have beenproposed in recent years, but most of them rely upon thelinguistic features developed and could not work well whenthe botmaster decides to change domain-generating strategy. This explains why some character-based detectors which workwell for traditional DGAs perform poorly when confrontedwith those plausibly clean-looking domain names based onwordlists (also called dictionaries) [3].In such scenario, we are concerned with developing analgorithm that is resilient to feature change and able to functionwell for not only character-based or wordlist-based DGAs , butalso for any kind of completely new algorithms that are neverseen before. We hold the intuition that bots tend to exhibit sim-ilar behavior patterns no matter what kind of DGA algorithmsare implemented. These time-relevant patterns provide morerobust and stable features which improve the ﬂexibility againstthe changing and evolving attacking strategies. For example,the bots controlled by the same entity communicate with thesame C&C server, and the botnet members cause a largeamount of regular tra ﬃ c when launching an attack. In addition,our system must be adaptive to the big data nature of Internettra ﬃ c and the explosive growth of malicious domains, so anwell-designed incremental training strategy is indispensable toreduce the model iteration cost.In this paper, we propose a novel DGA detection systemaiming for a wide range of DGA families and the never-endinggrowth of the Internet DNS tra ﬃ c. The critical nature of ourideas is to characterize time-series patterns of DNS queriesfor each IP address, explore temporal similarities betweendomains and apply incremental training strategy to speed upmodel updating.To sum up, the contributions of this work is threefold: In order to improve the ﬂexibility against the changingand evolving attacking strategies, we focus on the underlyingrelevance among the domains and utilize the latent patternsof DNS query sequences to detect DGAs. We also applyword2vec algorithm for a mapping from DGA detection tovector arithmetic. To cope with the never-ending growth of the InternetDNS tra ﬃ c, we utilize an incremental training strategy forword2vec al- gorithm, which helps to speed up the modeltraining process when additional training data is provided. We built a practical system based on the proposedalgorithm, and achieved excellent results in several empiricalexperiments and real-world deployments. / / $31.00 © a r X i v : . [ c s . CR ] S e p he rest of this paper is organized as follows. In Section 2,we introduce some background knowledge and systematicallyoutline related works. In Section 3, we propose our DGAdetection system based on the incremental word2vec algorithmwith details. Then we provide our experimental methodologyand results in Section 4. Finally, we summarize our primaryjobs and discuss future work in Section 5.II. R elated W ork A. DGA and DGA Detection

In order to detect DGA domains, Yadav et al. [4] proposeda technique based on the signiﬁcant di ﬀ erence between tradi-tional DGA domains and human generated domains in termsof the distribution of alphanumeric characters. In addition,Antonakakis et al. [1], Sch¨uppen et al. [5] and Wang et al.[6] proposed machine-learning based DGA detectors usinghuman-engineered lexical features of DGA domain names,while Tong et al. [7], Lison et al. [8] and Tran et al. [9] cameup with some methods using deep learning algorithms such asCNN, LSTM, and BiLSTM. However, attackers have designeda more resilient class of mAGDs produced by randomlyselecting and concatenating words from a dictionary in orderto imitate legitimate domain names created by a human. Thisnew kind of DGA is much harder to detect. In fact, many state-of-the-art DGA detectors which function well for traditionalDGAs, perform poorly when faced with wordlist-based ones.Confronted with such a challenging situation, defendershave presented several countermeasures. Pereira et al. [3]ﬁrstly proposed a method for combating wordlist-based DGAsin 2018. They built a new structure named WordGraph basedon the segmentation of domain names, and then employ it tofurther discover DGA dictionaries. Another existing approachraised by J.Koh et al. [10] extracted in-depth semantic featuresfrom unrelated corpus, and used transfer learning theoryto learn the semantic signatures of the wordlist-based DGAfamilies. These approaches perform well, but only focus onwordlist-based DGAs nevertheless. B. Word Embeddings Algorithm

The basic idea of word embeddings was initially proposedby Hinton in 1986 [11], which was called distributed rep-resentation at that time. This method is mainly used in thearea of Natural Language Processing (NLP), while we canstill utilize it in Domain Name System (DNS) analyzing ﬁeldby analogizing DNS query sequences to natural languagesentences, which follows the core idea of W. Lopez et al. [12]that considers DNS queries from a particular source IP addressduring a speciﬁc time interval as words in a single document .Nowadays the most ubiquitous word embeddings method is

Word2Vec [13], and in this paper we use

Skip-Gram modelwith Negative Sampling (SGNS) [14], an advanced variantof

Word2Vec , as basic algorithm for its popularity. Severalprevious works tried to apply word2vec algorithm to DNS-related ﬁeld ( [15], [16]), but their target is tra ﬃ c classiﬁcationinstead of DGA detection. Fig. 1. System architecture.

Although we illustrate that SGNS can accurately classifymAGDs through experiments, it turns out that existing neuralword embeddings methods, including SGNS, are multi-passalgorithms and thus cannot perform incremental model update,which means that they have to re-train the model on the oldand new training data from scratch when additional trainingdata is provided [17]. To this end, some researchers havefocused on exploring incremental training strategies of wordembeddings methods ( [17], [18]). Similar to the conventionalword2vec algorithm, there was also little or no attention paidto the incremental word2vec method in the literature when itcomes to DGA detection as far as we know.III. S ystem A rchitecture In this section, we describe our DGA detection systemarchitecture and training mechanism. As is shown in Fig. 1,our system has several components: Pre-processor, Detector,and Post-processor.

A. Pre-processor

Before the preprocessing phase, we deploy our data collec-tors in several core DNS servers in TUNET to collect raw DNSlogs. Thereafter, we apply black / white lists from both publicsources (e.g., malwaredomainlist.com ) and private sources toraw DNS corpus, which can be considered as a pre-labelingprocess. To further calibrate the labeling results, we samplepart of raw data for manual labeling. As for the unlabeledleftovers, they will be used for training word embeddings sinceword2vec algorithm is unsupervised.fter labeling process, we feed all of the data to a datawrangling and cleaning module, which functions as follows:First of all, the module traverses through the entire datasetand remove all queries containing invalid IP addresses, querytype or query name. Second, since many of the DNS queriesare nonexistent domain names, rarely duplicated, and in manycases composed of a large number of changing preﬁxes anda few unchanging su ﬃ xes, we decide to merge the similardomains by common su ﬃ xes. Besides, to eliminate the impactof ccTLD (country code Top-Level Domain [19]), we removeall ccTLDs from the tail of domain names containing them.Last but not least, we select the appropriate time windowsize and reorganize the data structure. More speciﬁcally, wedetermine a window size such as 10 minutes, which remains asa hyper-parameter to be decided later and partition the datasetaccordingly. Query records in each window are organized inthe format of [ timestamp , IP , domain , domain , · · · ]. Finally,all queries from a speciﬁc IP address during the pre-deﬁnedtime window are grouped and hence constitute a Document with each domain name is a

Word . B. Detector

Our detector consists of two parts: the incremental word2vecmodel and the subsequent simple classiﬁer.

1) Incremental Word2Vec Model:

Based on previous re-search results ( [17], [18]), we apply the incremental trainingmethod of SGNS to the domain-embeddings generation modelin this paper. Given one document output from Pre-processor,we assume that the words (domains) inside constitute a se-quence: w , w , w , · · · , w n . Then the classical SGNS modelattempts to minimize the following objective function to learndomain embeddings: L SGNS = − n n (cid:88) i = (cid:88) | j | < c , j (cid:44) log σ ( t w i · c w i + j ) + k E v ∼ q ( v ) [ log σ ( − t w i · c v ] (1) where t w i is the target word w i ’s embedding and c w i + j is thecontext word w i + j ’s embedding within a window of size c , σ ( x ) is the sigmoid function, k is a pre-ﬁxed integer, v is thenegative sample drawn from q ( v ), and q ( v ), which is referredto as negative sampling distribution [18]. While Equation(1)can be optimized by Stochastic Gradient Descent (SGD) usingAdaGrad [20] in an online fashion, traditional multi-passSGNS training still needs to scan through the entire dataset atﬁrst to pre-compute the negative sampling distribution q ( v ) ,which makes it di ﬃ cult to perform e ﬃ cient incremental modelupdate when additional training data is provided every singletime, especially when the amount of the new data is smallercompared to the old one.For the reason that new domains keep showing up con-tinuously in real world, we need to present an incrementalextension of SGNS. We adopt the methodology inherited fromprevious works ( [17], [18]), which goes through the train-ing data solely in a single-pass to update word embeddingsincrementally. Algorithm 1 presents this incremental SGNSalgorithm. Algorithm 1

Incremental SGNS for each new batch D of training data do f ( d ) ← d ∈ D n ← length ( D ) for i ← · · · , n do f ( d i ) ← f ( d i ) + q ( d ) ← f ( d ) α Σ d (cid:48)∈D f ( d (cid:48) ) α for all d ∈ D for j ← − c , · · · , − , , · · · , c do draw k negative samples from q ( d ) use adaptive SGD to update t w i , c w i + j , and c v , · · · , c v k end for end for end forAlgorithm 2 Draw Negative Samples set array r with length K to empty n ← length ( W ) cnt ← for i ← · · · , n do cnt ← cnt + if i ≤ K then r i ← w i else draw an interger k uniformly from 1 , , · · · , n if k ≤ K then r k ← w i end if end if end for In the implementation of incremental SGNS, how to e ﬃ -ciently produce negative samples is an important issue, sincethe e ﬃ ciency of sampling greatly a ﬀ ects the overall trainingspeed. To seek solution to this problem, here we utilize theReservoir Sampling [21] algorithm, which helps to generateone single negative sample in only O(1) time (See Algorithm2).

2) Logistic Regression Classiﬁer:

Without loss of gener-ality, we use the Logistic Regression classiﬁer as the tailclassiﬁer. It receives all the word-embeddings popped out fromword2vec model and the corresponding ground-truth label,which speciﬁes whether the domain is malicious or not. Itis noteworthy that logistic regression already has the potentialfor incremental training because it can update the parametersusing SGD every time there is new training data provided. Inthe testing / evaluation / deploy phase, the classiﬁer can directlycalculate the input domain-embeddings’ labels without extraoperations.

3) Workﬂow Description:

Before Detector, we already di-vide datasets into labeled part (relatively small) and unlabelledpart (relatively big) due to expensive manual labeling cost.Fortunately, we do not need ground truth when trainingunsupervised word2vec models. Hence, our overall trainingstrategy is to use all received valid data for training word2vecodel, while feed only labeled ones to the classiﬁer to makethe best use of collected data. Actually, based on such a greedystrategy, we have guaranteed the quality and generalizationability of the obtained domain-embeddings, which plays a vitalrole in improving the performance of the Logistic Regressionclassiﬁer trained with relatively small amounts of labeled data.

C. Post-processor

The classiﬁcation results from Detector can be used intwo ways. First, we can assume that a domain name whosescore is above a pre-set threshold is a DGA-generated domainname with a high probability. Thus, with this assumption, wecan construct a feedback loop to update the blacklists andwhitelists in the pre-processing session. Second, we can makeuse of the results of test dataset to conduct performance eval-uation of the detector and make analysis on hard-cases, whichis extremely meaningful for estimating the trends of currentand in-coming DGAs and further improving the performanceof DGA detection system.IV. D atasets

We collected DNS data for two consecutive weeks from theTsinghua campus network using Passive DNS [22] tools. Atotal of 162 million raw DNS query logs were obtained. Thelengthy periods of data recording guarantee a representativedataset which contains di ﬀ erent times of the day, di ﬀ erent daysof the week, and di ﬀ erent working / non-working days. Moreinformation about the datasets is shown in Table 1.There are some steps to be done before experiments.The critical points of our ideas are ﬁrst ﬁltering data withblack / white lists and then manual labeling. Due to the hugeamount of the collected raw data, we cannot a ﬀ ord to labelthem all. Thus we sample and label the ﬁrst 15% of the total162 million queries and split this labeled dataset into two parts:80% as trainset-with-gt (i.e. training set with ground truth) andthe left 20% as testset-with- gt (i.e. test set with ground truth),which are employed to conduct the comparison experimentsbetween our method and existing methods.The reason why we choose the ﬁrst 15% of datasets forlabeling is that during the time of collecting this part ofdata, we coincidentally found there are a large number ofdomains collected from query logs appearing on the maliciousdomain lists of some public blacklists such as DGArchive [23].Therefore, we started to collect data from that point of timeand took these DNS logs containing DGA domains as theground truth datasets to be labeled.Meanwhile, the last 85% unlabelled logs are used as thevalidation data for incremental word2vec algorithm becausewe intend to demonstrate that incremental word2vec couldfunction well not only for the initial labeled datasets but alsofor newly added data. We randomly sample them at a scale of1 /

10 in a consecutive way, then ﬁlter and manually label thisnew dataset. Again, the ﬁrst 80% and last 20% of the datasetare put into trainset-with-gt and testset- with-gt respectively.

TABLE ID etails of

TUNET DNS D atasets

Properties Descriptions

Duration of Data Collection A total of consecutive 14 daysGeneration Rate of DNS Queries About 500 thousand / hPeak Rate of DNS Queries About 3 million / hOccupied Space About 7GB / dayTotal Amount of DNS queries About 162 million in totalAmount of Unique Domains Queried About 1.9 million in totalTABLE IIA nalysis R esults of D atasets with G round -T ruth Properties Values

All DNS Total Amount 38,235,023Domains Unique Amount 463,030Benign DNS Total Amount 35,833,755Domains Unique Amount 368,793Total Amount 2,401,268DGA DNS Unique Amount 94,237Domains Character-Based 84,538 (Unique)Wordlist-based 9,699 (Unique)

V. E xperiments

This section evaluates the performance of the proposedscheme over the real-world network. All operations wereperformed on a terminal server with Intel i7 [email protected] and 32GB RAM running Ubuntu Linux 16.04.

A. Visualization

In order to be intuitive, we use t-SNE [24] to visualize thedomain-embeddings generated from the incremental word2vecmodel. As is shown in Fig.2, DGA domains belonging todi ﬀ erent families are labeled as class to and drawn indi ﬀ erent colors, while benign domains are labeled as class and densely clustered in the right of the picture. Amongthe classes, class drawn in cyan refers to wordlist-basedDGA domains, and the other classes except class 0 representcharacter-based DGA domains.Fig.2 demonstrates that the clusters of malicious and benigndomains can be divided neatly without di ﬃ culty with thehelp of incremental word2vec. It is suggested that thosewordlist-based DGA domain names, which always mislead tra-ditional methods, could be easily identiﬁed using incrementalword2vec algorithm. B. Comparison Experiment

To further evaluate our system performance, we conductcomparison experiments between IWM (short for

IncrementalWord2Vec Model ), FANCI (representative of traditional ma-chine learning methods ), CNN and LSTM (representative of character-based deep learning methods ).It is notable that current popular DGA detection methodssuch as FANCI [5] and D3N [7] usually conduct experiments ig. 2. Visualization of Domain-Embeddings with all DGA Families Usingt-SNE TABLE IIIR esults of E xp .1 to E xp .5 with A ll DGA F amilies

Methods Trainset Testset PRE TPR FPR F1-score

FANCI

NXD NXD 0.791 0.932 0.117 0.856

FANCI

NXD all 0.001 0.176 0.603 0.001

FANCI all all 0.967 0.717 0.012 0.823

CNN all all 0.962 0.906 0.0004 0.934

LSTM all all 0.949 0.962 0.0003 0.947

IWM all all solely on NXDomains, which would miss many DGA do-mains. In fact, the number of DGA domains in NXDomainsonly accounts for less than 36% of the total in all DNSdata. Moreover, we ﬁnd that FANCI method using RandomForests trained on NXDomains does not perform well on ourtestset containing only NXDomains, especially for wordlist-based malicious domains, which are almost undetectable inthis case. Furthermore, if we evaluate the model with ourtestsets containing domains beyond the NXDomains, barelycan it function normally.Considering the situation above, we design the followingsub-experiments. The results are published in Table 3 andTable 4.

Exp.1.

RF (short for Random Forests) of FANCI, trainedon NXDomains extracted from trainset-with-gt , evaluated onNXDomains extracted from testset-with-gt . Exp.2.

RF of FANCI, trained on NXDomains extractedfrom trainset-with-gt , evaluated on testset-with-gt . TABLE IVR esults of E xp .1 to E xp .5 with only W ordlist -B ased DGA

Methods Trainset Testset PRE TPR FPR F1-score

FANCI

NXD NXD 0.014 0.177 0.118 0.026

FANCI

NXD all 0.001 0.002 0.987 0.001

FANCI all all 0.096 0.133 0.012 0.112

CNN all all 0.239 0.475 0.014 0.318

LSTM all all 0.345 0.489 0.010 0.403

IWM all all

Exp.3.

RF of FANCI, trained on trainset-with-gt , evaluatedon testset-with-gt . Exp.4.

CNN and LSTM, trained on trainset-with-gt , evalu-ated on testset-with-gt . Exp.5.

IWM, trained on trainset-with-gt , evaluated on testset-with-gt .To be clear, we describe the training strategy for IWMhere: As for labeled datasets, we extract the ﬁrst half of the trainset-with-gt as the initial train set and the last half asthe new data continuously collected in the real world, named incremental-trainset . And for unlabelled datasets used to trainword embeddings, the same operations are conducted. For thesake of convenience, we divide the incremental train set (bothlabeled and unlabelled) into ten pieces. During the experiment,we ﬁrst train the initial model on the initial train set andthen conduct model updating operations for each newly addeddataset.The performance of models are evaluated with metrics asfollows:

Precision , True positive rate (also called

Recall ), False positive rate and

F1-score .Through Exp.1,2 and 5, we can conclude that the perfor-mance of FANCI can hardly meet peoples expectations. Andin order to validate that our method is superior to traditionalmachine learning and deep learning algorithms based solelyon domain name strings with the same train set and test set,we conducted experiments 3, 4, and 5. We can see that IWM,whether tested with all DGA families or only wordlist-basedDGA family, performs apparently better than FANCI, CNN,and LSTM.What is more, it can be inferred that when confrontedwith wordlist- based DGA domain names, IWM trumps otherdetectors with the result of 100% recall and 99.6% precision,while the best of the others can hardly achieve half of thevalues.

Fig. 3. Training Time of Basic and Incremental Word2Vec Method WhenNew Data is Provided) TABLE VC omparison R esults of B asic and I ncremental W ord ec A lgorithm Methods PRE TPR FPR F1-score Training TimeBasic

Incre . Evaluation of Incremental Methodology

This evaluation experiment is used to show that our polishedversion of the basic word2vec algorithm, i.e., incrementalword2vec, could perform as well as or even better than itspredecessor while gaining a tremendous acceleration in modelupdating. We construct two control groups: one is standardword2vec method, and the other is incremental word2vecmethod. The way training data is processed is the same asExp.5 in section B, and both groups use testset-with-gt asevaluation dataset. The training epoch number for both groupsis 200 and the evaluation results are listed in Table 5 and Fig.3. VI. C onclusion and F uture W ork In this paper, we presented a novel system using incre-mental word2vec algorithm, which leverages inter-domainrelationships to detect DGA domains e ﬀ ectively with scalablecapability. Our system performs ex- cellently when confrontedwith various DGA families, even with wordlist-based DGAs,which are almost invincible for traditional detectors.Moreover, to make model updating faster when new datais continuously provided, we explore an incremental trainingstrategy. In our empirical experiments, we demonstrate that ourincremental word2vec method could not only outperform otherdetectors but also gain a tremendous acceleration in model re-training.Since the datasets for training and evaluation are collectedcontinuously from the real-world networks, it is evident thatour system is an online system which deals with tens ofthousands of DNS query streams with high accuracy ande ﬃ ciency.The limitation of this paper is that the vocabulary ofincremental word2vec model could become very large whenunlimited data pours in, even though we already take measuressuch as merging common su ﬃ xes of domains to shorten thelength of the vocabulary. It is di ﬃ cult because we cannotmerely limit the max length of the vocabulary due to thepossible absence of some domains embedding that may leadthe classiﬁer to fail to ﬁnd a suitable vector representation.Besides, labeled datasets of domains are hard to obtain. Infuture work, we will pursue solutions to these problems andget a better detection model.A cknowledgment We thank Mingkai Tong, other classmates and teachersfor their valuable help. Additionally, we thank InformationTechnology Center of Tsinghua University for authorizing theuse of their data in our experiments. This work is supportedby the National Science and Technology Major Project underGrant No.2017YFB0803004.R eferences [1] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon, “From throw-away tra ﬃ cto bots: Detecting the rise of dga-based malware,” in Presented as part of the 21st USENIX Security Symposium(USENIX Security 12) . Bellevue, WA: USENIX, 2012, pp. 491–506. [Online]. Available: https: // / conference / usenixsecurity12 / technical-sessions / presentation / antonakakis[2] D. Plohmann, K. Yakdan, M. Klatt, J. Bader, and E. Gerhards-Padilla, “A comprehensive measurement study of domain generatingmalware,” in . Austin, TX: USENIX Association, Aug. 2016,pp. 263–278. [Online]. Available: https: // / conference / usenixsecurity16 / technical-sessions / presentation / plohmann[3] M. Pereira, S. Coleman, B. Yu, M. DeCock, and A. Nascimento, Dic-tionary Extraction and Detection of Algorithmically Generated DomainNames in Passive DNS Tra ﬃ c: 21st International Symposium, RAID2018, Heraklion, Crete, Greece, September 10-12, 2018, Proceedings ,09 2018, pp. 295–314.[4] S. Yadav, A. K. K. Reddy, A. L. N. Reddy, and S. Ranjan, “Detectingalgorithmically generated malicious domain names,” in Acm SigcommConference on Internet Measurement , 2010.[5] S. Sch¨uppen, D. Teubert, P. Herrmann, and U. Meyer, “ { FANCI } :Feature-based automated nxdomain classiﬁcation and intelligence,” in { USENIX } Security Symposium ( { USENIX } Security 18) , 2018, pp.1165–1181.[6] Z. Wang, Z. Jia, and B. Zhang, “A detection scheme for dgadomain names based on svm,” in .Atlantis Press, 2018 /

03. [Online]. Available: https: // doi.org / / mmsa-18.2018.58[7] M. Tong, X. Sun, J. Yang, H. Zhang, S. Zhu, X. Liu, and H. Liu, D3N:DGA Detection with Deep-Learning Through NXDomain , 08 2019, pp.464–471.[8] P. Lison and V. Mavroeidis, “Automatic detection of malware-generateddomains with recurrent neural models,”

CoRR , vol. abs / // arxiv.org / abs / Neurocomputing , vol. 275, 11 2017.[10] J. J. Koh and B. Rhodes, “Inline detection of domain generationalgorithms with context-sensitive word embeddings,”

CoRR , vol.abs / // arxiv.org / abs / et al. , “Learning distributed representations of concepts,”in Proceedings of the eighth annual conference of the cognitive sciencesociety , vol. 1. Amherst, MA, 1986, p. 12.[12] W. Lopez, J. Merlino, and P. Rodrguez-Bocca, “Vector representation ofinternet domain names using a word embedding technique,” 09 2017,pp. 1–8.[13] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “E ﬃ cient estimation ofword representations in vector space,” CoRR , vol. abs / CoRR ,vol. abs / // arxiv.org / abs / , pp. 232–236, 2018.[17] N. Kaji and H. Kobayashi, “Incremental skip-gram model with negativesampling,” CoRR , vol. abs / // arxiv.org / abs / CoRR , vol. abs / // arxiv.org / abs / // en.wikipedia.org / wiki / Country code top-level domain, 2019.[20] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methodsfor online learning and stochastic optimization,”

Journal of MachineLearning Research , vol. 12, no. Jul, pp. 2121–2159, 2011.[21] J. S. Vitter, “Random sampling with a reservoir,”