Universal Anomaly Detection: Algorithms and Applications
UUniversal Anomaly Detection: Algorithms andApplications
Shachar Siboni and Asaf Cohen,
Member, IEEE
Abstract —Modern computer threats are far more complicatedthan those seen in the past. They are constantly evolving,altering their appearance, perpetually changing disguise. Undersuch circumstances, detecting known threats, a fortiori zero-dayattacks, requires new tools, which are able to capture the essenceof their behavior, rather than some fixed signatures.In this work, we propose novel universal anomaly detectionalgorithms, which are able to learn the normal behavior ofsystems and alert for abnormalities, without any prior knowledgeon the system model, nor any knowledge on the characteristicsof the attack. The suggested method utilizes the Lempel-Zivuniversal compression algorithm in order to optimally giveprobability assignments for normal behavior (during learning),then estimate the likelihood of new data (during operation) andclassify it accordingly.The suggested technique is generic, and can be applied to dif-ferent scenarios. Indeed, we apply it to key problems in computersecurity. The first is detecting Botnets Command and Control(C&C) channels. A Botnet is a logical network of compromisedmachines which are remotely controlled by an attacker usinga C&C infrastructure, in order to perform malicious activities.We derive a detection algorithm based on timing data, whichcan be collected without deep inspection, from open as wellas encrypted flows. We evaluate the algorithm on real-worldnetwork traces, showing how a universal, low complexity C&Cidentification system can be built, with high detection ratesand low false-alarm probabilities. Further applications includemalicious tools detection via system calls monitoring and dataleakage identification.
Index Terms —Computer Security; Anomaly Detection; Univer-sal Compression; Probability Assignment; Individual Sequences;Botnets; Command and Control Channels; Malicious Tools; DataLeakage.
I. I
NTRODUCTION C YBER-ATTACKS are a disturbing security threat exist-ing today in communication- and computer-based sys-tems. They affect a wide range of domains including elec-tricity and water infrastructures, financial and capital markets,medicine and healthcare, army, businesses, enterprises anduniversities around the world. The majority of massive cyber-attacks today are conducted by Botnets, including DistributedDenial-of-Service (DDoS) attacks, spamming, fraud and iden-tity theft, etc.A Botnet is a logical network of compromised machines,Bots, which are remotely controlled by a Botmaster using
Parts of this work appeared at the Workshop on Information Forensics andSecurity, WIFS 2014, Atlanta, GA.S. Siboni and A. Cohen are with the Department of Communication SystemEngineering, Ben-Gurion University, Beer-Sheva, 84105, Israel. E-mails:[email protected]; [email protected] supported by the Israeli Chief Scientist under the Kabarnit consor-tium. a Command and Control (C&C) infrastructure. The compro-mised machines can be any collection of vulnerable hosts, e.g.computers, mobile-phones or tablets. Infection is via infectedwebsites, file-sharing networks, email attachments, and more(see infection tree analysis in [1]). Once a host is infected andbecomes a Bot, it is programmed to use a C&C channel forfurther downloads and updates and awaits instructions fromthe Botmaster. It updates its data and operates upon receivingcommands from the Botmaster (e.g., launch a DDoS attack).The C&C channel plays a key role in a Botnet by operatingas the communication means within the network. The Botmas-ter manages and controls its Bots using these C&C channelsin order to perform malicious activities on selected targets.This way, the Bots act as a distributed attack platform on-demand, coordinated by the Botmaster. However, due to thefact that the C&C channels are the only way the Botmastercan communicate with its Bots, they can be considered as theweakest link of a Botnet, as blocking them, renders the Botnetsuseless . Accordingly, a main objective is to identify and blockC&C activities before any real harm is caused.In order to mask their activities and bypass defense mecha-nisms such as firewalls, Botnets uses common communicationprotocols as their C&C, including IRC [2], [3], HTTP [3],Peer-to-Peer (P2P) [4], [5], [6] and DNS [7]. Recently, Botnetsalso adopted social networks as the underlying C&C [8].However, while it is tempting to develop protocol-specificmethods to detect Botnets, attackers constantly improve theirC&C infrastructures and develop new evasion capabilities,including changing signatures of the C&C traffic, employingencryption and obfuscation and using domain generation [9]in order to deceive detection systems.Current techniques for Botnet study and detection are basedon honeynets, signatures-based detection and anomaly detec-tion models [10]. Honeynets act as traps in order to collectinformation about Bots and study their behavior [11]. Oncethe mechanism of the monitored Bots is exposed, it is possibleto design a designated detection and blocking mechanism.Signature-based approaches rely on a signature database ofnotorious Botnets that were previously learned. However,signature-based techniques are prone to zero-day attacks andrequire a constant update of the signatures database [10].Anomaly-based detection techniques, on the other hand, aimto detect anomalies in network traffic or system behaviour,which may indicate the presence of malicious activities.A basic assumption when using anomaly detection is thatattacks differ from normal behavior. Thus, traffic analysis isused on both packet and flow levels, considering metrics suchas rate, volume, latency, response time and timestamps inorder to identify anomalous data. Indeed, anomaly detection a r X i v : . [ c s . CR ] A ug eems as a promising approach for Botnet detection sinceit may detect new structures of attacks (zero-day attacks).However, this may come at the cost of high false-alarm rates.Moreover, to achieve good performance, one may require priorknowledge, e.g., statistical assumptions on the normal data,such as a Markov Model [5] or ARMA modeling [12].Gu et al. proposed two anomaly-based detection systems,BotSniffer [3] and BotMiner [13], based on traffic analysis.The former was design for IRC- and HTTP-based Botnets,while the latter was designed as protocol independent de-tection, which requires no prior knowledge. However, bothsystems rely on Deep Packet Inspection techniques, hence areless suitable for on the fly analysis of large amounts of traffic.AsSadhan et al. [14] suggested that a periodic behavior indi-cates Botnet activity. Tegeler et al. [15] presented a detectionsystem, BotFinder, considering high-level statistical featuresof C&C communication which were extracted from knownBotnets in a controlled environment, limiting the ability todetect new types of attacks. Protocol specific systems weregiven by Villamarn-Salomn and Brustoloni [7] for DNS traffic,Chang and Daniels [4] for P2P Botnets topology and Strayer et al. [2] using correlation of packet sizes and timing patternsfor IRC-based Botnets.From classification-based point of view, Lu et al. [16] pre-sented a classification scheme, BotCop, using a decision treestatistical model to classify different types of applications. Yet,BotCop applied payload signatures techniques. Este et al. [17]and Mazzariello and Sansone [18] employed a Support VectorMachine as a single-class classifier model which constructsa statistical model based on a given training set in order todistinguish between normal and malicious activities.The suggested system does not rely on memoryless featuresof the data, such as specific values or signatures. In contrast,it builds a context tree for the learned data, hence, when a newdata sequence is tested, the order of values or events in it hasthe main impact on the classification performance. A. Our Contribution
We study the problem of detecting Botnets C&C channels.We suggest a novel universal anomaly detection algorithm,which uses no a-priori information about neither the Botnetstraffic patterns nor the normal behavior patterns, yet efficientlylearns the normal behavior in order to generate a statisticalmodel to which tested traffic can be compared.Our classification model is based on the celebrated LempelZiv algorithm, which is known as an optimal universal com-pression algorithm and hence a preferable universal predictionalgorithm when applied to stationary and ergodic sources overfinite alphabets. Using the probability assignment induced bythe prediction algorithm, we rigorously define the statisticalmodel which represents the normal behavior, and offer amechanism to test new, unknown sequences, using this model.Furthermore, we offer a new look on the way to use data in theclassification process , offering the context of the data sequenceas the key characteristic used in the classification.We evaluate the model with real-world network traces,when timing data is the main tested feature. This allows us to be both protocol-independent and encryption-independent.Moreover, it allows us to suggest a system which is immuneto various hiding techniques, especially when used with low-level features of the data, such as timings or sizes. The resultsclearly show that the suggested model is a favorable solutionfor the problem at hand, with excellent results in terms of lowfalse alarm and high detection rates, yet with only moderate(linear time) complexity and no deep packet inspection.Finally, we note that the suggested scheme is applicable to any sequence of behaviors, and not necessarily only timingdata . Hence, one can use it to test for anomalous applicationbehavior, anomalous communication patterns within an orga-nization and outside it, etc.A short conference version of this work appeared in [19].The current paper includes algorithms and results for twonew applications, together with additional explanations anddiscussions.The rest of the paper is organized as follows. Section IIgives the required background material. Section III describesthe key concept of universal anomaly detection, with thekey application,
Botnet Identification , as the main example.Section IV gives the tests results for this case, using realnetwork traces. Section V gives two additional applications,together with results on real data. Section VI discusses thepossible prevention strategies attackers can use against thesuggested system, and proves its robustness by arguing suchstrategies would require huge amounts of data and massivelearning. Finally, Section VII concludes this paper.II. P
RELIMINARIES
Classification refers to the problem of labeling unknown(new) instances to the most appropriate class among a set of(known) predefined classes. When the underlying probabilitydistributions for the classes { p i } Mi =1 are known, and we wishto decide which generated a given sequence y , a decision ruleof the form ˆ i = argmax ≤ i ≤ M p i ( y ) is optimal in the senseof minimizing the probability of error. In unary-class classifi-cation, however, information is available only on one type ofinstances. The goal may be to either identify such instances,or, in the case of anomaly detection , identify instances which do not fit the ones learned from . Indeed, when only few, if anyat all, anomalous instances exist to learn from, yet instancesof normal behavior are available, one can build a behavioralmodel based on the normal instances and classify any instancedeviating from that model as anomalous [20].Thus, given the probability distribution of the normal data, p ( · ) , the optimal decision rule in terms of maximizing thedetection probability given a fixed false alarm probability (inthe Neyman-Pearson sense) is to compare p ( y ) to a threshold ,and decide that y is normal if p ( y ) is above the threshold andanomalous otherwise. The threshold is determined accordingto the required false alarm probability. In practice, the un-derlying distribution which generates the normal sequencesis, of course, unknown. A reasonable approach in this caseis to estimate it using the previously observed sequencesand use the resulting estimate ˆ p ( · ) . Note, however, that theestimation problem differs significantly if a statistical modele.g., i.i.d. or Markovian of a certain order) is given, if the onlyknowledge is that the sequences are related to some stationaryand ergodic source, or, in the “worst” case, the data constitutesof individual sequences , that is, deterministic sequences withno pre-defined statistical model.In this paper, we suggest an anomaly detection technique forthe most general case, where no underlying statistical model isgiven. To do this, we build on the relation between prediction of discrete sequences and lossless compression [21], in orderto use universal compression algorithms and their associated probability assignment in the anomaly detection procedure. A. Universal Probability Assignment
The Lempel Ziv algorithm [22], LZ78, is a universalcompression algorithm with a vanishing redundancy . Conse-quently, it can also be used as an optimal universal prediction algorithm [21], using the appropriate probability assignment.The LZ78 algorithm is widely used in a variety of otherapplications. In the context of classification, it was also used in[23] for typist identification based on keyboard events and in[24] for English text, music pieces and proteins classification.For completeness, we briefly describe the compression methodand the associated probability assignment algorithm.The LZ78 algorithm is a dictionary-based compressionmethod. For a given sequence of data symbols, a dictionary ofphrases parsed from that sequence is constructed based on theincremental parsing process as follows. At the beginning thedictionary is empty. Then, during each step of the algorithm,the smallest prefix of consecutive data symbols not yet seen,i.e., which does not exist in the dictionary, is parsed and addedto the dictionary. By that, each phrase is a unique phrase inthe dictionary, that may extend a previously seen phrase byone symbol.Given a sequence s n = ( s s . . . s n ) , a parsed phrase , P , is the smallest prefix of consecutive data symbols thathas not been seen yet. This can also be considered assuffix concatenation of symbol s i (from the sequence) witha previously seen phrase P (cid:48) (from the dictionary), i.e., P = ( P (cid:48) s i ) . A dictionary , D , is a collection of all dis-tinct phrases parsed from a given data sequence s n , i.e., D = { P , P , . . . , P i , . . . P n } . For example, the sequence aabdbbacbbda is parsed as a | ab | d | b | ba | c | bb | da | .A common representation of the dictionary is a rooted-treewhere each phrase in the dictionary is represented as a pathfrom the root to an internal node in the tree according to theset of symbols the phrase consists of. In addition, leaf-nodesare added as suffix for each phrase in the tree. A statisticalmodel can be defined for a given data sequences during theconstruction of a phrase-tree [21], as described next.At the beginning, an initial tree is constructed includingonly a root node and k leaf-nodes as its children, where k is the size of the alphabet. Then, for each new phrase parsedfrom a sequence, the tree is traversed, starting from the root,following the set of symbols the phrase consists of, and endingat the appropriate leaf-node. Once a leaf-node is reached, thetree is extended at this point by adding all the symbols fromthe alphabet as immediate children nodes to that leaf, making it an internal node. In order to define a statistical model, eachnode in the tree, except for the root node, maintains a nodetraversal counter, where each leaf-node’s counter is set to 1and each internal node’s counter is equal to the sum of itsimmediate children’s counters.For a probability assignment, as all leaf-nodes’ countersare set to , they are assumed uniformly distributed with aprobability /i , where i is the total number of leaf-nodes.Each internal node’s probability is defined as the sum ofits immediate children’s probabilities, which also equals theratio between its counter and current i . For example, Figure 1demonstrates the resulting statistical model for the sequence“aabdbbacbbda”. Each node in the tree is represented bythe 3-tuple { symbol, counter, probability } . In addition, theprobability of an edge is defined by dividing the nodes’probabilities. Note that the probabilities of edges connecteddirectly to the root are equal to the appropriate root-children’scounter divided by the total number of leaf-nodes, i , at eachstep of the algorithm. The probability of a phrase P i ∈ D is calculated by multiplying the probabilities of the edgesalong the path defined by the symbols of P i . Moreover,note that for each phrase P i there exist a specific node inthe tree whose probability represents the probability of thatphrase. For instance, from the example shown in Figure 1,it can be seen that P ( ba ) = × = . Consideringa sequence S , if during the traversal a leaf-node is reachedbefore all the symbols of S are finished then the traversalreturn to the root and continue until all the symbols of thatsequence are consumed [23]. For example, the probability ofthe sequence “bdca” given the same statistical model above, isdefined as the following traversal probabilities multiplication:Root → b → d → Root → c → a and is calculated as: P ( bdca | M aabdbbacbbda ) = 1028 × × ×
14 = 1784 . This stems from the conditional probability ˆ P ( s t +1 | s t ) , where s t +1 is the next symbol after the (sub-)sequence s t , which iscalculated as the ratio between the counter of symbol s t +1 and the counter of symbol s t . We consider s t as the context of s t +1 at time t + 1 .III. A NOMALY D ETECTION V IA U NIVERSAL P ROBABILITY A SSIGNMENT
We now describe the building blocks of the anomaly detec-tion system. Throughout, the system is described in the contextof detecting anomalies in network traffic . Thus, in our problemdomain, the data instances are discrete sequences of networktraces. However, as previously mentioned, the proposed systemis generic, and can easily be adapted to detect anomalies in anydiscrete sequence of events, with the proper preprocessing. Infact, two additional applications of the algorithms below aregiven in Section V.
A. Preprocessing
A data sequence is defined as a series of events froma flow between a specific client and a specific host. The i th Network Event , denoted by e i,xy , is a data transaction ig. 1. An LZ78 Statistical Model for sequence “aabdbbacbbda”. between client x and host y and is defined by the tuple e i,xy = ( t i , tt i , csb i , scb i , x, y ) , where t i is the time event e i,xy occurred; tt i is the duration of event e i,xy ; csb i and scb i arethe total number of bytes which were sent by client x to host y and by host y to client x , respectively. A Network Flow ,denoted by f xy , is series of network events between client x and host y sorted by their time of occurrence, t i . That is, f xy = { e ,xy , e ,xy , . . . , e n,xy } .For actual learning and testing, it is not required to useall features (fields) in the data. As shown in the experimentalresults, good detection capabilities can be achieved even whenfocusing on a single feature. For example, timing data canbe characterized by the difference between two consecutiveevents of the same flow, denoted by Time-Difference (TD)and defined by T D i,xy = e i +1 ,xy ( t i +1 ) − e i,xy ( t i ) . A differentperspective is the total time the event took, denoted by Time-Taken (TT) and defined by T T i,xy = e i,xy ( tt i ) . Similarly, onecan focus only on sizes, e.g., Client-Server-Bytes (CSB) andServer-Client-Bytes (SCB) and respectively define CSB i,xy = e i,xy ( csb i ) and SCB i,xy = e i,xy ( scb i ) .Consequently, a single-feature data sequence is a serializa-tion of one of the above features, e.g, with respect to Time-Difference, a sequence/flow is defined as: f xy,T D = { e ,xy ( t ) − e ,xy ( t ) , e ,xy ( t ) − e ,xy ( t ) ,. . . , e n,xy ( t n ) − e n − ,xy ( t n − ) } . The above procedure may result in a sequence over a verylarge alphabet (as, for example, times are given with a veryhigh precision). To reduce the range of values, quantizationis performed. For k quantization levels, a set of k centroids { c , c , c , . . . , c k } , is used. The centroids are extracted fromthe available data during the training phase. Clearly, thenumber of centroids and the method for extracting them mayaffect the overall results. However, as seen in the experimentson real data, this fine-tuning is easily done during training. B. Learning
The LZ78-based classification model is divided into alearning phase and a testing phase, as illustrated in Figure 2.
BinaryClassification8–Decision8Making p{T j j | M LZ78 Statistical8Model8–M
LZ78
Data8Processing:Quantization8–8Q k {S i S={S h …S m … … …S n } LZ788UniversalPredictionAlgorithmData8Processing:Quantization8–8Q k {T j p(T j )=P(T j | M LZ78 ) TestingSequences
T={T h …T m … … …T m } Centroids {c h …8c m ………8c k } Fig. 2. A Classification Model based on the LZ78 prediction algorithm
In the learning phase, an LZ78 statistical model is built basedon a given training set of discrete (quantized) sequences overfinite alphabet S = { S , S , . . . , S n } , using the mechanismexplained in Section II-A. Of course, training is done only onnormal, benign traffic. In the testing phase, first, each testingsequence is separately quantized using the same quantizationmethod and the same set of centroids { c , c , . . . , c k } whichwere extracted in the learning phase. Then, the probability ofeach suspected sequence (testing sequence) T j from a giventesting set T = { T , T , . . . , T m } is estimated based on theconstructed statistical model (during the learning phase) andclassified respective to a predefined threshold T r . Specifically,the probability of each testing sequence, p ( T j ) , is estimatedusing sequential probability assignment given the LZ78 statis-tical model built in the learning phase . Testing sequences forwhich ˆ p ( T j ) is greater than or equal to T r are classified asnormal (as they “fit” the model) while a lower than thresholdvalue is classified as anomalous.Accordingly, the performance of the classifier is measuredby the false alarm and hit detection ratios, also known asfalse positive rate (FPR) or Type 1 error and true positive rate(TPR), respectively, and demonstrated by a ROC (ReceiverOperating Characteristic) curve. The false alarm ratio reflectsthe number of negative instances incorrectly classified aspositive in proportion to the total number of negatives inthe test, whereas hit detection ratio measures the proportionbetween the number of positive instances correctly classifiedand the total number of positives in the test. The ROC curves generated with respect to a set of thresholds, in our case inthe range [min j (ˆ p ( T j )) , max j (ˆ p ( T j ))] , where each thresholddefines a specific confusion matrix and results in a specificpoint in the graph.Note that the above classification model can be updated,either by extending the existing statistical model or rebuildingit, with new (training) data sequences at any given time. Thiscan be applied once the overall performance get lower thana given threshold for a configurable interval of time (and notfor each momentary decrease).IV. B OTNETS I DENTIFICATION
The data set used contains high-level real-world networktraces provided by an Internet security company (inorder to maintain the confidentiality of the company’scustomers, the name of the company is withheld). Thedata set consists of 3,714,238 client-server transactions,taken during a time window of approximately 3 hoursin specific day in 2009. The whole data set is available athttps://dl.dropboxusercontent.com/u/13592090/Botnet2009.rar.Each client, denoted by ‘cid’, may connect with several hosts(web-servers), denoted by ‘hid’. On each transaction,data is sent both by the client and the host. This definescommunication pairs CID HID. Each transaction is labeledas either legal, denoted by ‘good’, for normal data trafficgenerated by the client, or illegal transaction, denoted by‘hostile’, that is Bots. Labeling was done by the securitycompany’s experts based on well-known black-lists. Note thatthese labels are not used during the classification process.They are used only in the validation phase. Table I exemplifythe structure of the data set. Note that each transactionis represented by a single record in the data set, whichconsists of the following fields: ‘time’, referring to thetime the transaction took place; ‘time-taken’, is the totaltime the transaction took; ‘cs-bytes’ and ‘sc-bytes’ fieldsrepresent the total bytes sent by the client/server(host) to theserver(host)/client during the transaction, respectively; ‘mime-type’ denotes the Internet content type of the transaction,such as: plain text, image, html page, application, etc.; ‘cat’is the category of the transaction - ‘good’ or ‘hostile’; andthe ‘hid’ and ‘cid’ fields refer to the host-index (Internetsite, web-server) and client-index respectively. Again, toprotect the identity of the company’s customers, these indiceswere given arbitrary. However, some malicious sites areidentified by their domain name, e.g., ‘hotsearchworld.com’or ‘blitzkrieg88.bl.funpic.de’.Processing of the data included serialization and featureextraction: first, the given data set is split into set of flowsbased on CID HID connections, as illustrated in Table Iwith respect to flows 486 52 and 9 49 (marked in gray).A ‘Flow’ is a sequence of related transactions of the samecommunication pair CID HID sorted by time and with thesame label, either ‘good’ or ‘hostile’. In total, there are 19164flows labeled as ‘good’ and only 65 ‘hostile’ flows (0.338 % ).This indicates the imbalance of the data set where most ofthe transactions are legal and only a small fraction is illegal.However, this is characteristic of real network traffic behavior. TABLE IE
XAMPLE FOR THE D ATABASE S TRUCTURE . E
ACH RECORD IN THE DATASET REPRESENT A DATA TRANSACTION BETWEEN SPECIFIC CLIENT ANDSPECIFIC HOST / SERVER . time time-taken cs-bytes sc-bytes mime-type cat hid cid05:52:37 40 803 360 image/gif good 49 905:52:37 74 734 277 text/html good 102 1505:52:37 27 578 507 image/gif good 52 48605:52:37 27 578 507 image/gif good 75526 48605:52:37 25 655 4196 image/jpeg good 52 405:52:37 25 655 4196 image/jpeg good 75526 405:52:37 26 577 505 image/gif good 52 48605:52:37 26 577 505 image/gif good 75526 48605:52:37 31 624 960 image/gif good 52 605:52:37 31 624 960 image/gif good 75526 605:52:37 1 812 22672 application/octet-stream good 52 605:52:37 1 812 22672 application/octet-stream good 75526 605:52:37 30 707 4368 image/jpeg good 52 205:52:37 30 707 4368 image/jpeg good 75526 205:52:37 28 667 2639 image/jpeg good 52 405:52:37 28 667 2639 image/jpeg good 75526 405:52:37 180 434 1451 text/html;%20charset=iso-8859-1 hostile 3 605:52:37 34 710 4270 image/jpeg good 52 205:52:37 34 710 4270 image/jpeg good 75526 205:52:37 69 697 334 text/css good 49 9 Next, selected features are extracted from each transac-tion, e.g., Time-Difference, Time-Taken, Server-Client-Bytesand Client-Server-Bytes. After quantization, the resulting se-quences are the discrete time, finite alphabet sequence onwhich learning and testing was performed.Note that flows from a single client may consist of both‘good’ flows reflecting legitimate data traffic generated by theclient itself, as well as ‘hostile’ flows which are generated bya Bot installed on the client. In contrast, a server (host) hasonly flows with the same label, that is, if a host was labeled asa C&C infrastructure than all its transactions are consideredmalicious. The ‘Client’ and ‘Host’ definitions represented twodifferent perspectives of the data set. On the one hand one canexamine the events occurring in the network from the clientpoint of view, and on the other from the host point of view,as will be shown in the following results.
A. Experiments
Several experiments were conducted using the abovedatasets. Flows/sequences were randomly selected from thedatasets, from both ‘Client’ and ‘Host’ perspectives, and divided equally between the training and the testing phases (hence, ROCs are based only on newly seen data).We first tested which single-feature achieves the best results.The system was then optimized using this feature alone. Thebest results, in terms of optimal threshold and ROC-AUC(Area Under Curve), were achieved using the Time-Difference(TD) representation of the data sequences along with the‘Uniform’ quantization (several quantization algorithms weretested). To better understand why TD was superior, consider a legitimate web surfer compared to a hostile connection usingHTTP only as a C&C channel . While the surfer must have areasonable behavior in the time domain, affected by the timesrequired to read a page, the times required for the server torespond, etc., a C&C channel may behave differently, without,for example, a reasonable response time from the server as itonly collects data from the bots, and the “GET” messages areused solely to transmit information. Due to space limitation,we do not include the results for the inferior features, andfocus on the results under TD and uniform quantization. .0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0 False Positive Rate T r u e P o s i t i v e R a t e ð Subseq = H AUC:0.994 L ð Subseq = H AUC:0.990 L ð Subseq = H AUC:0.974 L Optimal Threshold: 7.87557 ´ - Anomaly Detection: 100. % False Alarm: 2.32378 % T r u e P o s i t i v e R a t e ð Subseq = H AUC:0.994 L ð Subseq = H AUC:0.990 L ð Subseq = H AUC:0.974 L Optimal Threshold: 7.87557 ´ - Anomaly Detection: 100. % False Alarm: 2.32378 % Fig. 3. Testing ‘Majority Vote Classification’ using TD feature and ‘Uniform’ quantization method by considering ‘Clients’ type of flows. Left: ReceiverOperating Curve; Right: Zoom in on the upper left corner. First, each testing sequence partitioned into several sets of subsequences, denoted as
Subseq inthe graph, and the decision made per set of subsequences, where better results achieved for higher subsequences definition.
Still, using TD, the optimal threshold for 100% detectionresults in 11.75% false alarms. However, this is when onlya single, short sequence is tested . To further improve theabove results, a majority vote for several sequences withinthe flow can be used. Each data segment is partitioned intoseveral subsequences of length 10. The classification is donebased on the majority of these subsequences’ estimations, aseither positive or negative, resulting in better classificationperformance as the number of subsequences is higher. Forexample, an AUC of 0.994 and false alarms rate of 2.32378%are achieved using a threshold of 7.87557x10 − , as illustratedin Figure 3. This is obtained at the cost of higher detection timeper data segment, of course, as using only one subsequence perdata segment the decision is made immediately, while using 9subsequences causes delay.The above results were obtained from a ‘Client’ pointof view by considering a semisupervised training. To ex-amine the ‘Host’ point of view, and differentiate between‘Semisupervised-Negative’, where one considers only normaldata sequences during training , ‘Semisupervised-Positive’,taking into account only anomalies sequences during training,and ‘Unsupervised’ where the training set consists of bothnormal and few anomalous sequences as well, we refer thereader to Figure 4. Clearly, trying to learn the anomalousbehavior fails (the red curve - AUC=0.219), as there areonly few samples, and C&C traffic may differ significantlyfor new bots on which the system was not trained. The keymessage to take from the figure is, however, that when learningis done using noisy data, which includes some C&C trafficbesides the normal one, there is no significant degradationin performance. That is, in ‘Unsupervised’ training mode,the classifier achieves very good results despite the fact thatthe underlying datasets used in the training phase containsboth normal and few anomalous sequences. ‘Semisupervised-Negative’ achieves the best results of AUC=0.998. Note that‘Unsupervised’ is the more realistic scenario where no a prioriinformation is available on the training data.Finally, for a concrete example, examining flows 6 3, 6 14and 9 1 under TD, ‘Uniform’ quantization, and a threshold T r u e P o s i t i v e R a t e Unsupervised H AUC: 0.993 L Semisupervised - Negative H AUC: 0.998 L Semisupervised - Positive H AUC: 0.219 L Optimal Threshold: 2.88783 ´ - Anomaly Detection: 100. % False Alarm: 3.51641 % Fig. 4. Testing ‘Training Modes’: Semisupervised-Negative, Semisupervised-Positive and Unsupervised modes, using TD feature and ‘Uniform’ quantiza-tion method with respect to ‘Hosts’ type of flows. The classifier achieves thebest results of AUC=0.998 with 100% detection and 3.51641% false alarms forSemisupervised-Negative training mode and the worst results of AUC=0.219and ∼
98% false alarms for 100% detection in case of Semisupervised-Positivetraining mode. of 2.88783x10 − (obtained from the last test case), wefound that all these flows are classified as anomalies. Fromclients point of view, this indicates that clients 6 and 9 areinfected by a Bot program, and form a host perspective thisimplies that hosts 1, 3 and 14 act as command and controlservers. By examine these three servers with a list of theactual domains corresponding to the host our findings wereconfirmed. Server 1 is known as ‘hotsearchworld.com’, Server3 is ‘blitzkrieg88.bl.funpic.de’ and Server 14 has IP addressof 209.123.8.198, which was black-listed.V. D ETECTION OF M ALICIOUS T OOLS AND D ATA L EAKAGE
In this section, we include additional results which furtherstrengthen the applicability of the universal anomaly detec-tion suggested. Specifically, we apply the anomaly detectionystem suggested to system calls in order to detect malicioustools on a Windows machine, and to TCP traffic of a server inorder to detect unwanted data leakage. In both experiments,the capability of the tool to detect abnormal behaviour withoutprior knowledge is demonstrated.
A. Monitoring the Context of System Calls for AnomalousBehaviour
The sequence of systems calls used by a process can serveas an identifier for the process behaviour and use of resources.Moreover, when a program is exploited or malicious tools arerunning, the sequence of system calls may differ significantlycompared to normal behaviour, incriminating the program orentire machine (see, e.g., [25] and references therein).In this part of the work, the universal anomaly detectiontool was used to learn the context of normal system calls ,and alert for anomalous behaviour. Specifically, the sequencesof system calls created by a process (e.g., firefox.exe) wererecorder, processed, and learned. Then, when viewing newdata from the same process, the anomaly detection algorithmcompared the processed new data to the learned model inorder to decide whether the process is still benign, or wasit maliciously exploited by some tool.Due to the large amount of possible system calls, callswere grouped into 7 types, based on the nature of the call:
Device, Files, Memory, Process, Registry, Security and Syn-chronization . That is, unlike the time-difference data describedin Section IV, herein, the quantization process did not includeany minimization of distances or a requirement for uniformprobabilities, but, rather, labeled the calls based on their knownfunctionality. Recording and classification used NtTrace [26].In the learning phase, system calls were recorder, quantizedaccording to the types above and then a discrete sequenceover the alphabet of size 7 was created. The sequence wasused to build the (normal behaviour) LZ tree, as describedin Section II, from which a histogram for the probabilities oftuples of length 20 was calculated. This histogram was theonly data saved from the learning phase . The learning phaseincluded 4 hours of data.For testing, segments of 2 minutes were recorded. For eachsegment, a histogram was calculated, similar to the learningphase (calculating probabilities for tuples of length 20 ovenan alphabet of size 7). In this part of the work, decisionswere made based on the Kullback-Leibler divergence (the
KLdistance [27, Section 2.3]) between the normal histogram andthe tested one.Figure 5 plots the KL distance between the histogram duringthe learning phase, and the histograms extracted during thetesting phase.
The process tested was firefox.exe, and the twovertical thick lines mark the time when the tool “Zeus” wasactive . It is very clear that the context of the system callschanges dramatically when the tool is active, and that simplemonitoring of the KL distances every few minutes is sufficientto detect a change in the system behaviour.
B. Identifying Data Leakage
In this part of the work, the universal anomaly detectionalgorithm was used in order to identify data leakage from a
TABLE IID
ATA LEAKAGE IDENTIFICATION . Ncat Ncat Normal Normal Normal Normal Normal Normal MSE 0.962 1.262 0.044 0.153 0.143 0.43 0.142 0.017KL 2.05 17.163 1.353 1.228 2.026 4.12 2.121 1.396
TABLE IIID
ATA LEAKAGE IDENTIFICATION WITH ADDITIONAL DOWNLOADS . Normal Normal + . MB Normal + MB Normal +
MBNormal 0.906 0.843 0.583 0.72Ncat 19.05 0.787 0.733 0.353 web server. Specifically, the setting was as follows. In thelearning phase (a period of a few days), benign traffic on aweb server was recorded using Wireshark [28]. Similar to theprevious examples, timing-based sequences where extracted,quantized and used in order to build an LZ tree. This LZ treeserved as a model for normal data.Then, using Ncat [29], a script was installed on the server.This script initiated downloads of large chunks of data fromthe server. Several periods, each 30 minutes long, of trafficwhich includes Ncat were recorded. For comparison, similarlength periods of traffic without Ncat were recorded as well.An LZ tree was built for each of the 30 minutes datasets .To identify data leakage, unlike the Botnets setting con-sidered in Section IV, in this case, we compared the jointdistributions of k -tuples resulting from the LZ trees. That is,we used the distribution of k -tuples resulting from the LZ treeas an identifier for the data set, and calculated the distancesbetween the distributions.Table II includes the results. The table depicts the distancesbetween the learned, normal data, and testing periods, twowhich include data leakage using Ncat and without. Twodistance measures where used: Mean Square Error (MSE)and KL distance. Under MSE, the leakage sessions clearlystand out compared to normal data. Results under the KLdistance are less clear, especially in the first Ncast session,which included more normal data than the second.Finally, to further challenge the algorithm, and see whetherdata leakage will also stand out when the normal commu-nication includes (peaceful) massive downloads, the normalcommunication was augmented with benign downloads ofvarious sizes. Table III depicts the results (under the KLdistance). It is clear that while Ncat stands out compared tonormal traffic on the web server, it is almost indistinguishablewhen the normal traffic learned includes downloads of largefiles . This is expected, as Ncat uses a similar protocol, and thekey differences in the timing are caused by file sizes. Hence,data leakage is clearly detected compared to normal surfing,yet, it is undistinguishable when the server, in peaceful times,serves large downloads.VI. D ISCUSSION
In this section, we discuss some related issues to theproblem in question. First, we discuss what an attacker, theBotmaster, can do in order to neutralize our solution, and weshow that it is quite complex to do so. Next, we present the :
43 20 :
11 20 :
40 21 :
08 21 :
37 22 :
06 22 :
35 23 :
04 23 :
33 0 :
02 0 :
31 1 :
00 1 :
30 1 :
59 2 :
28 2 :
57 3 :
26 3 :
54 4 :
23 4 :
52 5 :
21 5 :
50 6 :
19 6 :
48 7 :
17 7 :
47 8 :
16 8 :
45 9 :
14 9 :
43 10 :
27 11 :
21 12 :
12 12 :
41 13 :
15 13 :
57 14 :
37 15 :
31 16 :
25 17 : K L D i s t a n c e Time of Day
Fig. 5. KL distances between the learned histogram of normal behavior of firefox.exe and the histograms created every two minutes in the testing phase ofthe same process, as a function of time. The two gray vertical lines mark the time when “Zeus” was active. infinite alphabet problem that exists when using the Lempel-Ziv compression algorithm as a probability assignment mech-anism, and suggest another approach for that problem.
A. What Can An Attacker Do?
Remember that the statistical model constructed using theLempel-Ziv algorithm is based on previously observed se-quences, mainly generated by legitimate sources (clients), andeach newly seen sequence is assigned a probability based onthat model. Sequences with probability equal or higher thana predefined threshold are classified as normal, and otherwiseclassified as anomaly.Accordingly, the attacker may try to build the exact samemodel used by our suggested system in order to generateillegal sequences under the disguise of legitimate ones. Forthat matter, the attacker must have an access to the samedatabase that was used to build the above statistical model.However, this sensitive information, in most cases, is protectedand not available (that is, the service provider or organizationimplementing our system may use large amounts of legitimatetraffic recorded at that organization for the learning process).Therefore, the attacker’s strategy is to simulate that model ,or only part of it, where no prior knowledge of the underlyingprobability distribution of the sources generating the datasequences is available, by one of the following possibilities.To be more precise and be able to quantify probabilitiesrigorously, we first define the following. Let the underlyingalphabet be a binary alphabet A = { , } , and the length of thesequences be n . Assume legitimate sequences are generated i.i.d according to an underlying probability P , which isunknown, and the attacker generates sequences i.i.d accordingto a probability distribution Q , based on some estimate theattacker generated.In this case, the attacker may use a trial and error strategy,and will randomly generate sequences according to probabilitydistribution Q . Accordingly, the question arises is what is theprobability to accept sequences that were generated accordingto Q? To answer this question, we rely on the method of types [30], where a type P X of a sequence X = ( x , x , · · · , x n ), x i ∈ A , is defined as the relative proportions of occurrencesof each element from A in X (which is a probability massfunction of X on A ). For example, let A = { , , } and X =12123 . Accordingly, the type P X is P X (1) = , P X (2) = , P X (3) = . A type class of P X , denoted as T ( P X ) , isthe set of all sequences of length n and type P X (for a morecomplete discussion, see [30]).Under the above notation, the probability of type class T ( P ) under distribution Q n is − nD ( P || Q ) to first orderin the exponent, and more precisely, n +1) | A | − nD ( P || Q ) ≤ Q n ( T ( P )) ≤ − nD ( P || Q ) , where D ( P || Q ) is the Kullback-Leibler Divergence measure (which acts as an error exponentcomponent for that matter). Consequently, as long as theattacker does not know P , and uses an estimate Q (cid:54) = P , wehave D ( P || Q ) > , hence the above probability Q n ( T ( P )) decays exponentially as n grows to infinity. This means that,as we use longer sequences (in the testing phase), the attackerhave less chance to bypass and neutralize our suggestedsolution with any estimate Q (cid:54) = P .nother attacking approach which needs to be consideredis as follows. The attacker manages to obtain a legitimatesequence (that exists in the LZ78 phrase-tree) generated ac-cording to that P , e.g., by simulating/monitoring an HTTPlegitimate connection and extracting the time differences fromthat session. First, the attacker may try to use it periodicallyby sending an attack sequence with the same pattern of theabove sequence. This method will fail, as repeating a singlesequence over and over again, even if it is legitimate and wasderived from P , will create a stream whose distribution isfar from P . For example, consider a case where one takes ashort sequence, say of unbiased coin tosses, and generates . . . . Clearly, the resulting sequence will fail a testwhen comparing to an unbiased coin.A more sophisticated approach is to generate new sequencesbased on the above available sequence as presented in [31].The basic idea is as follows. Given the above legitimate se-quence, considered as a training sequence and denoted by X m ,where m is the length of the sequence, and a string of k purelyrandom bits U k , which are independent of X m , the objectiveis to generate new sequence(s) of the same length or shorter( n ≤ m ), denoted as Y n , with the same probability distributionof X m but with minimum statistical dependency betweenthese sequences. That is, try to generate new sequences asif we had the generating source itself. To achieve this goal,a deterministic function φ ( · ) , independent of the unknownsource P is employed, such that Y n = φ ( X m , U k ) , andminimum mutual information I ( X m ; Y n ) is required in orderto guarantee weak dependence, as much as possible, betweenthe given training sequence and the result output sequences.However, from the results obtained, it follows that in orderto faithfully represent the characteristics of the data, the inputlength m must to be as large as possible, and the number ofrandom bits k needed to guarantee low dependency between X m and Y n grows linearly with the output length n .Considering our problem domain, where the Botmastergenerates new sequences according to the above model and up-dates its Bots, using the C&C channels, with these sequencesin order to carry out the attack. On the one hand, for thecase where n < m , the Botmaster must constantly produceand maintain these k random bits (to guarantee low statisticaldependency), resulting in high complexity mechanism, and onthe other hand, for the case where n = m , the Botmaster needsto generate large sequences (to preserve the characteristicsof the original data, and specifically P ), which may makeit difficult to send these sequences, for example, as emailattachments. Note, one may suggest that instead of receivingthe above sequences through the C&C channels, the Botswill generate them independently. This is disqualified due toboth complexity (first to obtain a legal sequence and then togenerate new ones, while Bots should operate in a simplemanner as possible) and the requirement of a coordinatedattack. VII. C ONCLUSIONS
In this work, we proposed a generic, universal anomalydetection framework. The proposed framework is based on universal compression and probability assignment, and it isable to build models for the learned data without any priorknowledge or model-assumptions. The models can then beused to detect anomalous behavior and alert in cases of attacks.Specifically, using universal probability assignment tech-niques based on the LZ-78 algorithm, we were able to sug-gest a modeling system which does not require any priorknowledge on the normal behavior, yet learns its statisticalmodel optimally, in the sense that it converges to the trueprobability assignment whenever the source is stationary andergodic. Together with the optimal decision rule, based on theNeyman-Pearson criteria, the probability assignments result inrobust and efficient detection mechanisms. Moreover, as thetechnique suggested is based on practical universal compres-sion, it can be implemented with low complexity and minimalpre-processing overhead.To prove their applicability and test their performance,we applied the key techniques of this framework to severalproblems in computer security. The first was detecting C&Cchannels of Botnets. We evaluated the system on real-worldtraces . In particular, we offered to use time differences betweenevents in the network as the key feature, and showed howthe context of such a simple feature, easily learned using thesuggested algorithm, enables the detection of most Botnets inthe data set with a negligible false alarm probability.We continued with additional applications such as mon-itoring system calls in order to detect malicious tools andidentifying data leakage. The results for these applicationsconcurred with our main tests on C&C detection, confirm-ing the applicability of the suggested framework to severaldetection problems. Clearly, additional applications can besuggested. To name a few, we believe such tools can be usedto identify abnormal behavior of users in computer systems orabnormalities in large data networks based on traffic patternsand communication partners of the tested nodes.R
EFERENCES[1] Q. Wang, Z. Chen, and C. Chen, “On the characteristics of the worminfection family tree,”
Information Forensics and Security, IEEE Trans-actions on , vol. 7, no. 5, pp. 1614–1627, 2012.[2] W. T. Strayer, D. Lapsely, R. Walsh, and C. Livadas, “Botnet detectionbased on network behavior,” in
Botnet Detection . Springer, 2008, pp.1–24.[3] G. Gu, J. Zhang, and W. Lee, “Botsniffer: Detecting botnet commandand control channels in network traffic,” 2008.[4] S. Chang and T. E. Daniels, “P2p botnet detection using behaviorclustering & statistical tests,” in
Proceedings of the 2nd ACM Workshopon Security and Artificial Intelligence . ACM, 2009, pp. 23–30.[5] S.-K. Noh, J.-H. Oh, J.-S. Lee, B.-N. Noh, and H.-C. Jeong, “Detectingp2p botnets using a multi-phased flow model,” in
Digital Society, 2009.ICDS’09. Third International Conference on . IEEE, 2009, pp. 247–253.[6] J. Francois, S. Wang, W. Bronzi, R. State, and T. Engel, “Botcloud: de-tecting botnets using mapreduce,” in
Information Forensics and Security(WIFS), 2011 IEEE International Workshop on . IEEE, 2011, pp. 1–6.[7] R. Villamar´ın-Salom´on and J. C. Brustoloni, “Identifying botnets usinganomaly detection techniques applied to dns traffic,” in
ConsumerCommunications and Networking Conference, 2008. CCNC 2008. 5thIEEE . IEEE, 2008, pp. 476–481.[8] P. Burghouwt, M. Spruit, and H. Sips, “Towards detection of botnetcommunication through social media by monitoring user activity,” in
Information Systems Security . Springer, 2011, pp. 131–143.[9] S. Shin, G. Gu, N. Reddy, and C. P. Lee, “A large-scale empirical studyof conficker,”
Information Forensics and Security, IEEE Transactionson , vol. 7, no. 2, pp. 676–690, 2012.10] S. S. Silva, R. M. Silva, R. C. Pinto, and R. M. Salles, “Botnets: Asurvey,”
Computer Networks , vol. 57, no. 2, pp. 378–403, 2013.[11] F. H. Abbasi, R. J. Harris, G. Moretti, A. Haider, and N. Anwar,“Classification of malicious network streams using honeynets,” in
GlobalCommunications Conference (GLOBECOM), 2012 IEEE . IEEE, 2012,pp. 891–897.[12] M. Celenk, T. Conley, J. Willis, and J. Graham, “Predictive networkanomaly detection and visualization,”
Information Forensics and Secu-rity, IEEE Transactions on , vol. 5, no. 2, pp. 288–299, 2010.[13] G. Gu, R. Perdisci, J. Zhang, W. Lee et al. , “Botminer: Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetection.” in
USENIX Security Symposium , 2008, pp. 139–154.[14] B. AsSadhan, J. M. Moura, D. Lapsley, C. Jones, and W. T. Strayer,“Detecting botnets using command and control traffic,” in
Network Com-puting and Applications, 2009. NCA 2009. Eighth IEEE InternationalSymposium on . IEEE, 2009, pp. 156–162.[15] F. Tegeler, X. Fu, G. Vigna, and C. Kruegel, “Botfinder: Finding bots innetwork traffic without deep packet inspection,” in
Proceedings of the8th international conference on Emerging networking experiments andtechnologies . ACM, 2012, pp. 349–360.[16] W. Lu, M. Tavallaee, G. Rammidi, and A. A. Ghorbani, “Botcop: Anonline botnet traffic classifier,” in
Communication Networks and ServicesResearch Conference, 2009. CNSR’09. Seventh Annual . IEEE, 2009,pp. 70–77.[17] A. Este, F. Gringoli, and L. Salgarelli, “Support vector machines for tcptraffic classification,”
Computer Networks , vol. 53, no. 14, pp. 2476–2490, 2009.[18] C. Mazzariello and C. Sansone, “Anomaly-based detection of irc botnetsby means of one-class support vector classifiers,” in
Image Analysis andProcessing–ICIAP 2009 . Springer, 2009, pp. 883–892.[19] S. Siboni and A. Cohen, “Botnet identification via universal anomalydetection,” in
Information Forensics and Security (WIFS), 2014 IEEEInternational Workshop on , Dec 2014, pp. 101–106.[20] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
ACM Computing Surveys (CSUR) , vol. 41, no. 3, p. 15, 2009.[21] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of indi-vidual sequences,”
Information Theory, IEEE Transactions on , vol. 38,no. 4, pp. 1258–1270, 1992.[22] J. Ziv and A. Lempel, “Compression of individual sequences viavariable-rate coding,”
Information Theory, IEEE Transactions on ,vol. 24, no. 5, pp. 530–536, 1978.[23] M. Nisenson, I. Yariv, R. El-Yaniv, and R. Meir, “Towards behaviometricsecurity systems: Learning to identify a typist,” in
Knowledge Discoveryin Databases: PKDD 2003 . Springer, 2003, pp. 363–374.[24] R. Begleiter, R. El-Yaniv, and G. Yona, “On prediction using variableorder markov models,”
J. Artif. Intell. Res.(JAIR) , vol. 22, pp. 385–421,2004.[25] D. S. Fava, S. R. Byers, and S. J. Yang, “Projecting cyberattacks throughvariable-length markov models,”
Information Forensics and Security,IEEE Transactions on
Elements of information theory
Elements of information theory . JohnWiley & Sons, 2012.[31] N. Merhav and M. J. Weinberger, “On universal simulation of informa-tion sources using training data,”