[PDF] Universal Anomaly Detection: Algorithms and Applications

Abstract

Modern computer threats are far more complicated than those seen in the past. They are constantly evolving, altering their appearance, perpetually changing disguise. Under such circumstances, detecting known threats, a fortiori zero-day attacks, requires new tools, which are able to capture the essence of their behavior, rather than some fixed signatures. In this work, we propose novel universal anomaly detection algorithms, which are able to learn the normal behavior of systems and alert for abnormalities, without any prior knowledge on the system model, nor any knowledge on the characteristics of the attack. The suggested method utilizes the Lempel-Ziv universal compression algorithm in order to optimally give probability assignments for normal behavior (during learning), then estimate the likelihood of new data (during operation) and classify it accordingly. The suggested technique is generic, and can be applied to different scenarios. Indeed, we apply it to key problems in computer security. The first is detecting Botnets Command and Control (C&C) channels. A Botnet is a logical network of compromised machines which are remotely controlled by an attacker using a C&C infrastructure, in order to perform malicious activities. We derive a detection algorithm based on timing data, which can be collected without deep inspection, from open as well as encrypted flows. We evaluate the algorithm on real-world network traces, showing how a universal, low complexity C&C identification system can be built, with high detection rates and low false-alarm probabilities. Further applications include malicious tools detection via system calls monitoring and data leakage identification.

Full PDF

UUniversal Anomaly Detection: Algorithms andApplications

Shachar Siboni and Asaf Cohen,

Member, IEEE

Abstract —Modern computer threats are far more complicatedthan those seen in the past. They are constantly evolving,altering their appearance, perpetually changing disguise. Undersuch circumstances, detecting known threats, a fortiori zero-dayattacks, requires new tools, which are able to capture the essenceof their behavior, rather than some ﬁxed signatures.In this work, we propose novel universal anomaly detectionalgorithms, which are able to learn the normal behavior ofsystems and alert for abnormalities, without any prior knowledgeon the system model, nor any knowledge on the characteristicsof the attack. The suggested method utilizes the Lempel-Zivuniversal compression algorithm in order to optimally giveprobability assignments for normal behavior (during learning),then estimate the likelihood of new data (during operation) andclassify it accordingly.The suggested technique is generic, and can be applied to dif-ferent scenarios. Indeed, we apply it to key problems in computersecurity. The ﬁrst is detecting Botnets Command and Control(C&C) channels. A Botnet is a logical network of compromisedmachines which are remotely controlled by an attacker usinga C&C infrastructure, in order to perform malicious activities.We derive a detection algorithm based on timing data, whichcan be collected without deep inspection, from open as wellas encrypted ﬂows. We evaluate the algorithm on real-worldnetwork traces, showing how a universal, low complexity C&Cidentiﬁcation system can be built, with high detection ratesand low false-alarm probabilities. Further applications includemalicious tools detection via system calls monitoring and dataleakage identiﬁcation.

Index Terms —Computer Security; Anomaly Detection; Univer-sal Compression; Probability Assignment; Individual Sequences;Botnets; Command and Control Channels; Malicious Tools; DataLeakage.

I. I

NTRODUCTION C YBER-ATTACKS are a disturbing security threat exist-ing today in communication- and computer-based sys-tems. They affect a wide range of domains including elec-tricity and water infrastructures, ﬁnancial and capital markets,medicine and healthcare, army, businesses, enterprises anduniversities around the world. The majority of massive cyber-attacks today are conducted by Botnets, including DistributedDenial-of-Service (DDoS) attacks, spamming, fraud and iden-tity theft, etc.A Botnet is a logical network of compromised machines,Bots, which are remotely controlled by a Botmaster using

Parts of this work appeared at the Workshop on Information Forensics andSecurity, WIFS 2014, Atlanta, GA.S. Siboni and A. Cohen are with the Department of Communication SystemEngineering, Ben-Gurion University, Beer-Sheva, 84105, Israel. E-mails:[email protected]; [email protected] supported by the Israeli Chief Scientist under the Kabarnit consor-tium. a Command and Control (C&C) infrastructure. The compro-mised machines can be any collection of vulnerable hosts, e.g.computers, mobile-phones or tablets. Infection is via infectedwebsites, ﬁle-sharing networks, email attachments, and more(see infection tree analysis in [1]). Once a host is infected andbecomes a Bot, it is programmed to use a C&C channel forfurther downloads and updates and awaits instructions fromthe Botmaster. It updates its data and operates upon receivingcommands from the Botmaster (e.g., launch a DDoS attack).The C&C channel plays a key role in a Botnet by operatingas the communication means within the network. The Botmas-ter manages and controls its Bots using these C&C channelsin order to perform malicious activities on selected targets.This way, the Bots act as a distributed attack platform on-demand, coordinated by the Botmaster. However, due to thefact that the C&C channels are the only way the Botmastercan communicate with its Bots, they can be considered as theweakest link of a Botnet, as blocking them, renders the Botnetsuseless . Accordingly, a main objective is to identify and blockC&C activities before any real harm is caused.In order to mask their activities and bypass defense mecha-nisms such as ﬁrewalls, Botnets uses common communicationprotocols as their C&C, including IRC [2], [3], HTTP [3],Peer-to-Peer (P2P) [4], [5], [6] and DNS [7]. Recently, Botnetsalso adopted social networks as the underlying C&C [8].However, while it is tempting to develop protocol-speciﬁcmethods to detect Botnets, attackers constantly improve theirC&C infrastructures and develop new evasion capabilities,including changing signatures of the C&C trafﬁc, employingencryption and obfuscation and using domain generation [9]in order to deceive detection systems.Current techniques for Botnet study and detection are basedon honeynets, signatures-based detection and anomaly detec-tion models [10]. Honeynets act as traps in order to collectinformation about Bots and study their behavior [11]. Oncethe mechanism of the monitored Bots is exposed, it is possibleto design a designated detection and blocking mechanism.Signature-based approaches rely on a signature database ofnotorious Botnets that were previously learned. However,signature-based techniques are prone to zero-day attacks andrequire a constant update of the signatures database [10].Anomaly-based detection techniques, on the other hand, aimto detect anomalies in network trafﬁc or system behaviour,which may indicate the presence of malicious activities.A basic assumption when using anomaly detection is thatattacks differ from normal behavior. Thus, trafﬁc analysis isused on both packet and ﬂow levels, considering metrics suchas rate, volume, latency, response time and timestamps inorder to identify anomalous data. Indeed, anomaly detection a r X i v : . [ c s . CR ] A ug eems as a promising approach for Botnet detection sinceit may detect new structures of attacks (zero-day attacks).However, this may come at the cost of high false-alarm rates.Moreover, to achieve good performance, one may require priorknowledge, e.g., statistical assumptions on the normal data,such as a Markov Model [5] or ARMA modeling [12].Gu et al. proposed two anomaly-based detection systems,BotSniffer [3] and BotMiner [13], based on trafﬁc analysis.The former was design for IRC- and HTTP-based Botnets,while the latter was designed as protocol independent de-tection, which requires no prior knowledge. However, bothsystems rely on Deep Packet Inspection techniques, hence areless suitable for on the ﬂy analysis of large amounts of trafﬁc.AsSadhan et al. [14] suggested that a periodic behavior indi-cates Botnet activity. Tegeler et al. [15] presented a detectionsystem, BotFinder, considering high-level statistical featuresof C&C communication which were extracted from knownBotnets in a controlled environment, limiting the ability todetect new types of attacks. Protocol speciﬁc systems weregiven by Villamarn-Salomn and Brustoloni [7] for DNS trafﬁc,Chang and Daniels [4] for P2P Botnets topology and Strayer et al. [2] using correlation of packet sizes and timing patternsfor IRC-based Botnets.From classiﬁcation-based point of view, Lu et al. [16] pre-sented a classiﬁcation scheme, BotCop, using a decision treestatistical model to classify different types of applications. Yet,BotCop applied payload signatures techniques. Este et al. [17]and Mazzariello and Sansone [18] employed a Support VectorMachine as a single-class classiﬁer model which constructsa statistical model based on a given training set in order todistinguish between normal and malicious activities.The suggested system does not rely on memoryless featuresof the data, such as speciﬁc values or signatures. In contrast,it builds a context tree for the learned data, hence, when a newdata sequence is tested, the order of values or events in it hasthe main impact on the classiﬁcation performance. A. Our Contribution

We study the problem of detecting Botnets C&C channels.We suggest a novel universal anomaly detection algorithm,which uses no a-priori information about neither the Botnetstrafﬁc patterns nor the normal behavior patterns, yet efﬁcientlylearns the normal behavior in order to generate a statisticalmodel to which tested trafﬁc can be compared.Our classiﬁcation model is based on the celebrated LempelZiv algorithm, which is known as an optimal universal com-pression algorithm and hence a preferable universal predictionalgorithm when applied to stationary and ergodic sources overﬁnite alphabets. Using the probability assignment induced bythe prediction algorithm, we rigorously deﬁne the statisticalmodel which represents the normal behavior, and offer amechanism to test new, unknown sequences, using this model.Furthermore, we offer a new look on the way to use data in theclassiﬁcation process , offering the context of the data sequenceas the key characteristic used in the classiﬁcation.We evaluate the model with real-world network traces,when timing data is the main tested feature. This allows us to be both protocol-independent and encryption-independent.Moreover, it allows us to suggest a system which is immuneto various hiding techniques, especially when used with low-level features of the data, such as timings or sizes. The resultsclearly show that the suggested model is a favorable solutionfor the problem at hand, with excellent results in terms of lowfalse alarm and high detection rates, yet with only moderate(linear time) complexity and no deep packet inspection.Finally, we note that the suggested scheme is applicable to any sequence of behaviors, and not necessarily only timingdata . Hence, one can use it to test for anomalous applicationbehavior, anomalous communication patterns within an orga-nization and outside it, etc.A short conference version of this work appeared in [19].The current paper includes algorithms and results for twonew applications, together with additional explanations anddiscussions.The rest of the paper is organized as follows. Section IIgives the required background material. Section III describesthe key concept of universal anomaly detection, with thekey application,

Botnet Identiﬁcation , as the main example.Section IV gives the tests results for this case, using realnetwork traces. Section V gives two additional applications,together with results on real data. Section VI discusses thepossible prevention strategies attackers can use against thesuggested system, and proves its robustness by arguing suchstrategies would require huge amounts of data and massivelearning. Finally, Section VII concludes this paper.II. P

RELIMINARIES

Classiﬁcation refers to the problem of labeling unknown(new) instances to the most appropriate class among a set of(known) predeﬁned classes. When the underlying probabilitydistributions for the classes { p i } Mi =1 are known, and we wishto decide which generated a given sequence y , a decision ruleof the form ˆ i = argmax ≤ i ≤ M p i ( y ) is optimal in the senseof minimizing the probability of error. In unary-class classiﬁ-cation, however, information is available only on one type ofinstances. The goal may be to either identify such instances,or, in the case of anomaly detection , identify instances which do not ﬁt the ones learned from . Indeed, when only few, if anyat all, anomalous instances exist to learn from, yet instancesof normal behavior are available, one can build a behavioralmodel based on the normal instances and classify any instancedeviating from that model as anomalous [20].Thus, given the probability distribution of the normal data, p ( · ) , the optimal decision rule in terms of maximizing thedetection probability given a ﬁxed false alarm probability (inthe Neyman-Pearson sense) is to compare p ( y ) to a threshold ,and decide that y is normal if p ( y ) is above the threshold andanomalous otherwise. The threshold is determined accordingto the required false alarm probability. In practice, the un-derlying distribution which generates the normal sequencesis, of course, unknown. A reasonable approach in this caseis to estimate it using the previously observed sequencesand use the resulting estimate ˆ p ( · ) . Note, however, that theestimation problem differs signiﬁcantly if a statistical modele.g., i.i.d. or Markovian of a certain order) is given, if the onlyknowledge is that the sequences are related to some stationaryand ergodic source, or, in the “worst” case, the data constitutesof individual sequences , that is, deterministic sequences withno pre-deﬁned statistical model.In this paper, we suggest an anomaly detection technique forthe most general case, where no underlying statistical model isgiven. To do this, we build on the relation between prediction of discrete sequences and lossless compression [21], in orderto use universal compression algorithms and their associated probability assignment in the anomaly detection procedure. A. Universal Probability Assignment

The Lempel Ziv algorithm [22], LZ78, is a universalcompression algorithm with a vanishing redundancy . Conse-quently, it can also be used as an optimal universal prediction algorithm [21], using the appropriate probability assignment.The LZ78 algorithm is widely used in a variety of otherapplications. In the context of classiﬁcation, it was also used in[23] for typist identiﬁcation based on keyboard events and in[24] for English text, music pieces and proteins classiﬁcation.For completeness, we brieﬂy describe the compression methodand the associated probability assignment algorithm.The LZ78 algorithm is a dictionary-based compressionmethod. For a given sequence of data symbols, a dictionary ofphrases parsed from that sequence is constructed based on theincremental parsing process as follows. At the beginning thedictionary is empty. Then, during each step of the algorithm,the smallest preﬁx of consecutive data symbols not yet seen,i.e., which does not exist in the dictionary, is parsed and addedto the dictionary. By that, each phrase is a unique phrase inthe dictionary, that may extend a previously seen phrase byone symbol.Given a sequence s n = ( s s . . . s n ) , a parsed phrase , P , is the smallest preﬁx of consecutive data symbols thathas not been seen yet. This can also be considered assufﬁx concatenation of symbol s i (from the sequence) witha previously seen phrase P (cid:48) (from the dictionary), i.e., P = ( P (cid:48) s i ) . A dictionary , D , is a collection of all dis-tinct phrases parsed from a given data sequence s n , i.e., D = { P , P , . . . , P i , . . . P n } . For example, the sequence aabdbbacbbda is parsed as a | ab | d | b | ba | c | bb | da | .A common representation of the dictionary is a rooted-treewhere each phrase in the dictionary is represented as a pathfrom the root to an internal node in the tree according to theset of symbols the phrase consists of. In addition, leaf-nodesare added as sufﬁx for each phrase in the tree. A statisticalmodel can be deﬁned for a given data sequences during theconstruction of a phrase-tree [21], as described next.At the beginning, an initial tree is constructed includingonly a root node and k leaf-nodes as its children, where k is the size of the alphabet. Then, for each new phrase parsedfrom a sequence, the tree is traversed, starting from the root,following the set of symbols the phrase consists of, and endingat the appropriate leaf-node. Once a leaf-node is reached, thetree is extended at this point by adding all the symbols fromthe alphabet as immediate children nodes to that leaf, making it an internal node. In order to deﬁne a statistical model, eachnode in the tree, except for the root node, maintains a nodetraversal counter, where each leaf-node’s counter is set to 1and each internal node’s counter is equal to the sum of itsimmediate children’s counters.For a probability assignment, as all leaf-nodes’ countersare set to , they are assumed uniformly distributed with aprobability /i , where i is the total number of leaf-nodes.Each internal node’s probability is deﬁned as the sum ofits immediate children’s probabilities, which also equals theratio between its counter and current i . For example, Figure 1demonstrates the resulting statistical model for the sequence“aabdbbacbbda”. Each node in the tree is represented bythe 3-tuple { symbol, counter, probability } . In addition, theprobability of an edge is deﬁned by dividing the nodes’probabilities. Note that the probabilities of edges connecteddirectly to the root are equal to the appropriate root-children’scounter divided by the total number of leaf-nodes, i , at eachstep of the algorithm. The probability of a phrase P i ∈ D is calculated by multiplying the probabilities of the edgesalong the path deﬁned by the symbols of P i . Moreover,note that for each phrase P i there exist a speciﬁc node inthe tree whose probability represents the probability of thatphrase. For instance, from the example shown in Figure 1,it can be seen that P ( ba ) = × = . Consideringa sequence S , if during the traversal a leaf-node is reachedbefore all the symbols of S are ﬁnished then the traversalreturn to the root and continue until all the symbols of thatsequence are consumed [23]. For example, the probability ofthe sequence “bdca” given the same statistical model above, isdeﬁned as the following traversal probabilities multiplication:Root → b → d → Root → c → a and is calculated as: P ( bdca | M aabdbbacbbda ) = 1028 × × ×

14 = 1784 . This stems from the conditional probability ˆ P ( s t +1 | s t ) , where s t +1 is the next symbol after the (sub-)sequence s t , which iscalculated as the ratio between the counter of symbol s t +1 and the counter of symbol s t . We consider s t as the context of s t +1 at time t + 1 .III. A NOMALY D ETECTION V IA U NIVERSAL P ROBABILITY A SSIGNMENT

We now describe the building blocks of the anomaly detec-tion system. Throughout, the system is described in the contextof detecting anomalies in network trafﬁc . Thus, in our problemdomain, the data instances are discrete sequences of networktraces. However, as previously mentioned, the proposed systemis generic, and can easily be adapted to detect anomalies in anydiscrete sequence of events, with the proper preprocessing. Infact, two additional applications of the algorithms below aregiven in Section V.

A. Preprocessing

A data sequence is deﬁned as a series of events froma ﬂow between a speciﬁc client and a speciﬁc host. The i th Network Event , denoted by e i,xy , is a data transaction ig. 1. An LZ78 Statistical Model for sequence “aabdbbacbbda”. between client x and host y and is deﬁned by the tuple e i,xy = ( t i , tt i , csb i , scb i , x, y ) , where t i is the time event e i,xy occurred; tt i is the duration of event e i,xy ; csb i and scb i arethe total number of bytes which were sent by client x to host y and by host y to client x , respectively. A Network Flow ,denoted by f xy , is series of network events between client x and host y sorted by their time of occurrence, t i . That is, f xy = { e ,xy , e ,xy , . . . , e n,xy } .For actual learning and testing, it is not required to useall features (ﬁelds) in the data. As shown in the experimentalresults, good detection capabilities can be achieved even whenfocusing on a single feature. For example, timing data canbe characterized by the difference between two consecutiveevents of the same ﬂow, denoted by Time-Difference (TD)and deﬁned by T D i,xy = e i +1 ,xy ( t i +1 ) − e i,xy ( t i ) . A differentperspective is the total time the event took, denoted by Time-Taken (TT) and deﬁned by T T i,xy = e i,xy ( tt i ) . Similarly, onecan focus only on sizes, e.g., Client-Server-Bytes (CSB) andServer-Client-Bytes (SCB) and respectively deﬁne CSB i,xy = e i,xy ( csb i ) and SCB i,xy = e i,xy ( scb i ) .Consequently, a single-feature data sequence is a serializa-tion of one of the above features, e.g, with respect to Time-Difference, a sequence/ﬂow is deﬁned as: f xy,T D = { e ,xy ( t ) − e ,xy ( t ) , e ,xy ( t ) − e ,xy ( t ) ,. . . , e n,xy ( t n ) − e n − ,xy ( t n − ) } . The above procedure may result in a sequence over a verylarge alphabet (as, for example, times are given with a veryhigh precision). To reduce the range of values, quantizationis performed. For k quantization levels, a set of k centroids { c , c , c , . . . , c k } , is used. The centroids are extracted fromthe available data during the training phase. Clearly, thenumber of centroids and the method for extracting them mayaffect the overall results. However, as seen in the experimentson real data, this ﬁne-tuning is easily done during training. B. Learning

The LZ78-based classiﬁcation model is divided into alearning phase and a testing phase, as illustrated in Figure 2.

BinaryClassification8–Decision8Making p{T j j | M LZ78 Statistical8Model8–M

LZ78

Data8Processing:Quantization8–8Q k {S i S={S h …S m … … …S n } LZ788UniversalPredictionAlgorithmData8Processing:Quantization8–8Q k {T j p(T j )=P(T j | M LZ78 ) TestingSequences

T={T h …T m … … …T m } Centroids {c h …8c m ………8c k } Fig. 2. A Classiﬁcation Model based on the LZ78 prediction algorithm

In the learning phase, an LZ78 statistical model is built basedon a given training set of discrete (quantized) sequences overﬁnite alphabet S = { S , S , . . . , S n } , using the mechanismexplained in Section II-A. Of course, training is done only onnormal, benign trafﬁc. In the testing phase, ﬁrst, each testingsequence is separately quantized using the same quantizationmethod and the same set of centroids { c , c , . . . , c k } whichwere extracted in the learning phase. Then, the probability ofeach suspected sequence (testing sequence) T j from a giventesting set T = { T , T , . . . , T m } is estimated based on theconstructed statistical model (during the learning phase) andclassiﬁed respective to a predeﬁned threshold T r . Speciﬁcally,the probability of each testing sequence, p ( T j ) , is estimatedusing sequential probability assignment given the LZ78 statis-tical model built in the learning phase . Testing sequences forwhich ˆ p ( T j ) is greater than or equal to T r are classiﬁed asnormal (as they “ﬁt” the model) while a lower than thresholdvalue is classiﬁed as anomalous.Accordingly, the performance of the classiﬁer is measuredby the false alarm and hit detection ratios, also known asfalse positive rate (FPR) or Type 1 error and true positive rate(TPR), respectively, and demonstrated by a ROC (ReceiverOperating Characteristic) curve. The false alarm ratio reﬂectsthe number of negative instances incorrectly classiﬁed aspositive in proportion to the total number of negatives inthe test, whereas hit detection ratio measures the proportionbetween the number of positive instances correctly classiﬁedand the total number of positives in the test. The ROC curves generated with respect to a set of thresholds, in our case inthe range [min j (ˆ p ( T j )) , max j (ˆ p ( T j ))] , where each thresholddeﬁnes a speciﬁc confusion matrix and results in a speciﬁcpoint in the graph.Note that the above classiﬁcation model can be updated,either by extending the existing statistical model or rebuildingit, with new (training) data sequences at any given time. Thiscan be applied once the overall performance get lower thana given threshold for a conﬁgurable interval of time (and notfor each momentary decrease).IV. B OTNETS I DENTIFICATION

The data set used contains high-level real-world networktraces provided by an Internet security company (inorder to maintain the conﬁdentiality of the company’scustomers, the name of the company is withheld). Thedata set consists of 3,714,238 client-server transactions,taken during a time window of approximately 3 hoursin speciﬁc day in 2009. The whole data set is available athttps://dl.dropboxusercontent.com/u/13592090/Botnet2009.rar.Each client, denoted by ‘cid’, may connect with several hosts(web-servers), denoted by ‘hid’. On each transaction,data is sent both by the client and the host. This deﬁnescommunication pairs CID HID. Each transaction is labeledas either legal, denoted by ‘good’, for normal data trafﬁcgenerated by the client, or illegal transaction, denoted by‘hostile’, that is Bots. Labeling was done by the securitycompany’s experts based on well-known black-lists. Note thatthese labels are not used during the classiﬁcation process.They are used only in the validation phase. Table I exemplifythe structure of the data set. Note that each transactionis represented by a single record in the data set, whichconsists of the following ﬁelds: ‘time’, referring to thetime the transaction took place; ‘time-taken’, is the totaltime the transaction took; ‘cs-bytes’ and ‘sc-bytes’ ﬁeldsrepresent the total bytes sent by the client/server(host) to theserver(host)/client during the transaction, respectively; ‘mime-type’ denotes the Internet content type of the transaction,such as: plain text, image, html page, application, etc.; ‘cat’is the category of the transaction - ‘good’ or ‘hostile’; andthe ‘hid’ and ‘cid’ ﬁelds refer to the host-index (Internetsite, web-server) and client-index respectively. Again, toprotect the identity of the company’s customers, these indiceswere given arbitrary. However, some malicious sites areidentiﬁed by their domain name, e.g., ‘hotsearchworld.com’or ‘blitzkrieg88.bl.funpic.de’.Processing of the data included serialization and featureextraction: ﬁrst, the given data set is split into set of ﬂowsbased on CID HID connections, as illustrated in Table Iwith respect to ﬂows 486 52 and 9 49 (marked in gray).A ‘Flow’ is a sequence of related transactions of the samecommunication pair CID HID sorted by time and with thesame label, either ‘good’ or ‘hostile’. In total, there are 19164ﬂows labeled as ‘good’ and only 65 ‘hostile’ ﬂows (0.338 % ).This indicates the imbalance of the data set where most ofthe transactions are legal and only a small fraction is illegal.However, this is characteristic of real network trafﬁc behavior. TABLE IE

XAMPLE FOR THE D ATABASE S TRUCTURE . E

ACH RECORD IN THE DATASET REPRESENT A DATA TRANSACTION BETWEEN SPECIFIC CLIENT ANDSPECIFIC HOST / SERVER . time time-taken cs-bytes sc-bytes mime-type cat hid cid05:52:37 40 803 360 image/gif good 49 905:52:37 74 734 277 text/html good 102 1505:52:37 27 578 507 image/gif good 52 48605:52:37 27 578 507 image/gif good 75526 48605:52:37 25 655 4196 image/jpeg good 52 405:52:37 25 655 4196 image/jpeg good 75526 405:52:37 26 577 505 image/gif good 52 48605:52:37 26 577 505 image/gif good 75526 48605:52:37 31 624 960 image/gif good 52 605:52:37 31 624 960 image/gif good 75526 605:52:37 1 812 22672 application/octet-stream good 52 605:52:37 1 812 22672 application/octet-stream good 75526 605:52:37 30 707 4368 image/jpeg good 52 205:52:37 30 707 4368 image/jpeg good 75526 205:52:37 28 667 2639 image/jpeg good 52 405:52:37 28 667 2639 image/jpeg good 75526 405:52:37 180 434 1451 text/html;%20charset=iso-8859-1 hostile 3 605:52:37 34 710 4270 image/jpeg good 52 205:52:37 34 710 4270 image/jpeg good 75526 205:52:37 69 697 334 text/css good 49 9 Next, selected features are extracted from each transac-tion, e.g., Time-Difference, Time-Taken, Server-Client-Bytesand Client-Server-Bytes. After quantization, the resulting se-quences are the discrete time, ﬁnite alphabet sequence onwhich learning and testing was performed.Note that ﬂows from a single client may consist of both‘good’ ﬂows reﬂecting legitimate data trafﬁc generated by theclient itself, as well as ‘hostile’ ﬂows which are generated bya Bot installed on the client. In contrast, a server (host) hasonly ﬂows with the same label, that is, if a host was labeled asa C&C infrastructure than all its transactions are consideredmalicious. The ‘Client’ and ‘Host’ deﬁnitions represented twodifferent perspectives of the data set. On the one hand one canexamine the events occurring in the network from the clientpoint of view, and on the other from the host point of view,as will be shown in the following results.

A. Experiments

Several experiments were conducted using the abovedatasets. Flows/sequences were randomly selected from thedatasets, from both ‘Client’ and ‘Host’ perspectives, and divided equally between the training and the testing phases (hence, ROCs are based only on newly seen data).We ﬁrst tested which single-feature achieves the best results.The system was then optimized using this feature alone. Thebest results, in terms of optimal threshold and ROC-AUC(Area Under Curve), were achieved using the Time-Difference(TD) representation of the data sequences along with the‘Uniform’ quantization (several quantization algorithms weretested). To better understand why TD was superior, consider a legitimate web surfer compared to a hostile connection usingHTTP only as a C&C channel . While the surfer must have areasonable behavior in the time domain, affected by the timesrequired to read a page, the times required for the server torespond, etc., a C&C channel may behave differently, without,for example, a reasonable response time from the server as itonly collects data from the bots, and the “GET” messages areused solely to transmit information. Due to space limitation,we do not include the results for the inferior features, andfocus on the results under TD and uniform quantization. .0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0 False Positive Rate T r u e P o s i t i v e R a t e ð Subseq = H AUC:0.994 L ð Subseq = H AUC:0.990 L ð Subseq = H AUC:0.974 L Optimal Threshold: 7.87557 ´ - Anomaly Detection: 100. % False Alarm: 2.32378 % T r u e P o s i t i v e R a t e ð Subseq = H AUC:0.994 L ð Subseq = H AUC:0.990 L ð Subseq = H AUC:0.974 L Optimal Threshold: 7.87557 ´ - Anomaly Detection: 100. % False Alarm: 2.32378 % Fig. 3. Testing ‘Majority Vote Classiﬁcation’ using TD feature and ‘Uniform’ quantization method by considering ‘Clients’ type of ﬂows. Left: ReceiverOperating Curve; Right: Zoom in on the upper left corner. First, each testing sequence partitioned into several sets of subsequences, denoted as

Subseq inthe graph, and the decision made per set of subsequences, where better results achieved for higher subsequences deﬁnition.

Still, using TD, the optimal threshold for 100% detectionresults in 11.75% false alarms. However, this is when onlya single, short sequence is tested . To further improve theabove results, a majority vote for several sequences withinthe ﬂow can be used. Each data segment is partitioned intoseveral subsequences of length 10. The classiﬁcation is donebased on the majority of these subsequences’ estimations, aseither positive or negative, resulting in better classiﬁcationperformance as the number of subsequences is higher. Forexample, an AUC of 0.994 and false alarms rate of 2.32378%are achieved using a threshold of 7.87557x10 − , as illustratedin Figure 3. This is obtained at the cost of higher detection timeper data segment, of course, as using only one subsequence perdata segment the decision is made immediately, while using 9subsequences causes delay.The above results were obtained from a ‘Client’ pointof view by considering a semisupervised training. To ex-amine the ‘Host’ point of view, and differentiate between‘Semisupervised-Negative’, where one considers only normaldata sequences during training , ‘Semisupervised-Positive’,taking into account only anomalies sequences during training,and ‘Unsupervised’ where the training set consists of bothnormal and few anomalous sequences as well, we refer thereader to Figure 4. Clearly, trying to learn the anomalousbehavior fails (the red curve - AUC=0.219), as there areonly few samples, and C&C trafﬁc may differ signiﬁcantlyfor new bots on which the system was not trained. The keymessage to take from the ﬁgure is, however, that when learningis done using noisy data, which includes some C&C trafﬁcbesides the normal one, there is no signiﬁcant degradationin performance. That is, in ‘Unsupervised’ training mode,the classiﬁer achieves very good results despite the fact thatthe underlying datasets used in the training phase containsboth normal and few anomalous sequences. ‘Semisupervised-Negative’ achieves the best results of AUC=0.998. Note that‘Unsupervised’ is the more realistic scenario where no a prioriinformation is available on the training data.Finally, for a concrete example, examining ﬂows 6 3, 6 14and 9 1 under TD, ‘Uniform’ quantization, and a threshold T r u e P o s i t i v e R a t e Unsupervised H AUC: 0.993 L Semisupervised - Negative H AUC: 0.998 L Semisupervised - Positive H AUC: 0.219 L Optimal Threshold: 2.88783 ´ - Anomaly Detection: 100. % False Alarm: 3.51641 % Fig. 4. Testing ‘Training Modes’: Semisupervised-Negative, Semisupervised-Positive and Unsupervised modes, using TD feature and ‘Uniform’ quantiza-tion method with respect to ‘Hosts’ type of ﬂows. The classiﬁer achieves thebest results of AUC=0.998 with 100% detection and 3.51641% false alarms forSemisupervised-Negative training mode and the worst results of AUC=0.219and ∼

98% false alarms for 100% detection in case of Semisupervised-Positivetraining mode. of 2.88783x10 − (obtained from the last test case), wefound that all these ﬂows are classiﬁed as anomalies. Fromclients point of view, this indicates that clients 6 and 9 areinfected by a Bot program, and form a host perspective thisimplies that hosts 1, 3 and 14 act as command and controlservers. By examine these three servers with a list of theactual domains corresponding to the host our ﬁndings wereconﬁrmed. Server 1 is known as ‘hotsearchworld.com’, Server3 is ‘blitzkrieg88.bl.funpic.de’ and Server 14 has IP addressof 209.123.8.198, which was black-listed.V. D ETECTION OF M ALICIOUS T OOLS AND D ATA L EAKAGE

In this section, we include additional results which furtherstrengthen the applicability of the universal anomaly detec-tion suggested. Speciﬁcally, we apply the anomaly detectionystem suggested to system calls in order to detect malicioustools on a Windows machine, and to TCP trafﬁc of a server inorder to detect unwanted data leakage. In both experiments,the capability of the tool to detect abnormal behaviour withoutprior knowledge is demonstrated.

A. Monitoring the Context of System Calls for AnomalousBehaviour

The sequence of systems calls used by a process can serveas an identiﬁer for the process behaviour and use of resources.Moreover, when a program is exploited or malicious tools arerunning, the sequence of system calls may differ signiﬁcantlycompared to normal behaviour, incriminating the program orentire machine (see, e.g., [25] and references therein).In this part of the work, the universal anomaly detectiontool was used to learn the context of normal system calls ,and alert for anomalous behaviour. Speciﬁcally, the sequencesof system calls created by a process (e.g., ﬁrefox.exe) wererecorder, processed, and learned. Then, when viewing newdata from the same process, the anomaly detection algorithmcompared the processed new data to the learned model inorder to decide whether the process is still benign, or wasit maliciously exploited by some tool.Due to the large amount of possible system calls, callswere grouped into 7 types, based on the nature of the call:

Device, Files, Memory, Process, Registry, Security and Syn-chronization . That is, unlike the time-difference data describedin Section IV, herein, the quantization process did not includeany minimization of distances or a requirement for uniformprobabilities, but, rather, labeled the calls based on their knownfunctionality. Recording and classiﬁcation used NtTrace [26].In the learning phase, system calls were recorder, quantizedaccording to the types above and then a discrete sequenceover the alphabet of size 7 was created. The sequence wasused to build the (normal behaviour) LZ tree, as describedin Section II, from which a histogram for the probabilities oftuples of length 20 was calculated. This histogram was theonly data saved from the learning phase . The learning phaseincluded 4 hours of data.For testing, segments of 2 minutes were recorded. For eachsegment, a histogram was calculated, similar to the learningphase (calculating probabilities for tuples of length 20 ovenan alphabet of size 7). In this part of the work, decisionswere made based on the Kullback-Leibler divergence (the

KLdistance [27, Section 2.3]) between the normal histogram andthe tested one.Figure 5 plots the KL distance between the histogram duringthe learning phase, and the histograms extracted during thetesting phase.

The process tested was ﬁrefox.exe, and the twovertical thick lines mark the time when the tool “Zeus” wasactive . It is very clear that the context of the system callschanges dramatically when the tool is active, and that simplemonitoring of the KL distances every few minutes is sufﬁcientto detect a change in the system behaviour.

B. Identifying Data Leakage

In this part of the work, the universal anomaly detectionalgorithm was used in order to identify data leakage from a

TABLE IID

ATA LEAKAGE IDENTIFICATION . Ncat Ncat Normal Normal Normal Normal Normal Normal MSE 0.962 1.262 0.044 0.153 0.143 0.43 0.142 0.017KL 2.05 17.163 1.353 1.228 2.026 4.12 2.121 1.396

TABLE IIID

ATA LEAKAGE IDENTIFICATION WITH ADDITIONAL DOWNLOADS . Normal Normal + . MB Normal + MB Normal +

MBNormal 0.906 0.843 0.583 0.72Ncat 19.05 0.787 0.733 0.353 web server. Speciﬁcally, the setting was as follows. In thelearning phase (a period of a few days), benign trafﬁc on aweb server was recorded using Wireshark [28]. Similar to theprevious examples, timing-based sequences where extracted,quantized and used in order to build an LZ tree. This LZ treeserved as a model for normal data.Then, using Ncat [29], a script was installed on the server.This script initiated downloads of large chunks of data fromthe server. Several periods, each 30 minutes long, of trafﬁcwhich includes Ncat were recorded. For comparison, similarlength periods of trafﬁc without Ncat were recorded as well.An LZ tree was built for each of the 30 minutes datasets .To identify data leakage, unlike the Botnets setting con-sidered in Section IV, in this case, we compared the jointdistributions of k -tuples resulting from the LZ trees. That is,we used the distribution of k -tuples resulting from the LZ treeas an identiﬁer for the data set, and calculated the distancesbetween the distributions.Table II includes the results. The table depicts the distancesbetween the learned, normal data, and testing periods, twowhich include data leakage using Ncat and without. Twodistance measures where used: Mean Square Error (MSE)and KL distance. Under MSE, the leakage sessions clearlystand out compared to normal data. Results under the KLdistance are less clear, especially in the ﬁrst Ncast session,which included more normal data than the second.Finally, to further challenge the algorithm, and see whetherdata leakage will also stand out when the normal commu-nication includes (peaceful) massive downloads, the normalcommunication was augmented with benign downloads ofvarious sizes. Table III depicts the results (under the KLdistance). It is clear that while Ncat stands out compared tonormal trafﬁc on the web server, it is almost indistinguishablewhen the normal trafﬁc learned includes downloads of largeﬁles . This is expected, as Ncat uses a similar protocol, and thekey differences in the timing are caused by ﬁle sizes. Hence,data leakage is clearly detected compared to normal surﬁng,yet, it is undistinguishable when the server, in peaceful times,serves large downloads.VI. D ISCUSSION

In this section, we discuss some related issues to theproblem in question. First, we discuss what an attacker, theBotmaster, can do in order to neutralize our solution, and weshow that it is quite complex to do so. Next, we present the :

43 20 :

11 20 :

40 21 :

08 21 :

37 22 :

06 22 :

35 23 :

04 23 :

33 0 :

02 0 :

31 1 :

00 1 :

30 1 :

59 2 :

28 2 :

57 3 :

26 3 :

54 4 :

23 4 :

52 5 :

21 5 :

50 6 :

19 6 :

48 7 :

17 7 :

47 8 :

16 8 :

45 9 :

14 9 :

43 10 :

27 11 :

21 12 :

12 12 :

41 13 :

15 13 :

57 14 :

37 15 :

31 16 :

25 17 : K L D i s t a n c e Time of Day

Fig. 5. KL distances between the learned histogram of normal behavior of ﬁrefox.exe and the histograms created every two minutes in the testing phase ofthe same process, as a function of time. The two gray vertical lines mark the time when “Zeus” was active. inﬁnite alphabet problem that exists when using the Lempel-Ziv compression algorithm as a probability assignment mech-anism, and suggest another approach for that problem.

A. What Can An Attacker Do?

Remember that the statistical model constructed using theLempel-Ziv algorithm is based on previously observed se-quences, mainly generated by legitimate sources (clients), andeach newly seen sequence is assigned a probability based onthat model. Sequences with probability equal or higher thana predeﬁned threshold are classiﬁed as normal, and otherwiseclassiﬁed as anomaly.Accordingly, the attacker may try to build the exact samemodel used by our suggested system in order to generateillegal sequences under the disguise of legitimate ones. Forthat matter, the attacker must have an access to the samedatabase that was used to build the above statistical model.However, this sensitive information, in most cases, is protectedand not available (that is, the service provider or organizationimplementing our system may use large amounts of legitimatetrafﬁc recorded at that organization for the learning process).Therefore, the attacker’s strategy is to simulate that model ,or only part of it, where no prior knowledge of the underlyingprobability distribution of the sources generating the datasequences is available, by one of the following possibilities.To be more precise and be able to quantify probabilitiesrigorously, we ﬁrst deﬁne the following. Let the underlyingalphabet be a binary alphabet A = { , } , and the length of thesequences be n . Assume legitimate sequences are generated i.i.d according to an underlying probability P , which isunknown, and the attacker generates sequences i.i.d accordingto a probability distribution Q , based on some estimate theattacker generated.In this case, the attacker may use a trial and error strategy,and will randomly generate sequences according to probabilitydistribution Q . Accordingly, the question arises is what is theprobability to accept sequences that were generated accordingto Q? To answer this question, we rely on the method of types [30], where a type P X of a sequence X = ( x , x , · · · , x n ), x i ∈ A , is deﬁned as the relative proportions of occurrencesof each element from A in X (which is a probability massfunction of X on A ). For example, let A = { , , } and X =12123 . Accordingly, the type P X is P X (1) = , P X (2) = , P X (3) = . A type class of P X , denoted as T ( P X ) , isthe set of all sequences of length n and type P X (for a morecomplete discussion, see [30]).Under the above notation, the probability of type class T ( P ) under distribution Q n is − nD ( P || Q ) to ﬁrst orderin the exponent, and more precisely, n +1) | A | − nD ( P || Q ) ≤ Q n ( T ( P )) ≤ − nD ( P || Q ) , where D ( P || Q ) is the Kullback-Leibler Divergence measure (which acts as an error exponentcomponent for that matter). Consequently, as long as theattacker does not know P , and uses an estimate Q (cid:54) = P , wehave D ( P || Q ) > , hence the above probability Q n ( T ( P )) decays exponentially as n grows to inﬁnity. This means that,as we use longer sequences (in the testing phase), the attackerhave less chance to bypass and neutralize our suggestedsolution with any estimate Q (cid:54) = P .nother attacking approach which needs to be consideredis as follows. The attacker manages to obtain a legitimatesequence (that exists in the LZ78 phrase-tree) generated ac-cording to that P , e.g., by simulating/monitoring an HTTPlegitimate connection and extracting the time differences fromthat session. First, the attacker may try to use it periodicallyby sending an attack sequence with the same pattern of theabove sequence. This method will fail, as repeating a singlesequence over and over again, even if it is legitimate and wasderived from P , will create a stream whose distribution isfar from P . For example, consider a case where one takes ashort sequence, say of unbiased coin tosses, and generates . . . . Clearly, the resulting sequence will fail a testwhen comparing to an unbiased coin.A more sophisticated approach is to generate new sequencesbased on the above available sequence as presented in [31].The basic idea is as follows. Given the above legitimate se-quence, considered as a training sequence and denoted by X m ,where m is the length of the sequence, and a string of k purelyrandom bits U k , which are independent of X m , the objectiveis to generate new sequence(s) of the same length or shorter( n ≤ m ), denoted as Y n , with the same probability distributionof X m but with minimum statistical dependency betweenthese sequences. That is, try to generate new sequences asif we had the generating source itself. To achieve this goal,a deterministic function φ ( · ) , independent of the unknownsource P is employed, such that Y n = φ ( X m , U k ) , andminimum mutual information I ( X m ; Y n ) is required in orderto guarantee weak dependence, as much as possible, betweenthe given training sequence and the result output sequences.However, from the results obtained, it follows that in orderto faithfully represent the characteristics of the data, the inputlength m must to be as large as possible, and the number ofrandom bits k needed to guarantee low dependency between X m and Y n grows linearly with the output length n .Considering our problem domain, where the Botmastergenerates new sequences according to the above model and up-dates its Bots, using the C&C channels, with these sequencesin order to carry out the attack. On the one hand, for thecase where n < m , the Botmaster must constantly produceand maintain these k random bits (to guarantee low statisticaldependency), resulting in high complexity mechanism, and onthe other hand, for the case where n = m , the Botmaster needsto generate large sequences (to preserve the characteristicsof the original data, and speciﬁcally P ), which may makeit difﬁcult to send these sequences, for example, as emailattachments. Note, one may suggest that instead of receivingthe above sequences through the C&C channels, the Botswill generate them independently. This is disqualiﬁed due toboth complexity (ﬁrst to obtain a legal sequence and then togenerate new ones, while Bots should operate in a simplemanner as possible) and the requirement of a coordinatedattack. VII. C ONCLUSIONS

In this work, we proposed a generic, universal anomalydetection framework. The proposed framework is based on universal compression and probability assignment, and it isable to build models for the learned data without any priorknowledge or model-assumptions. The models can then beused to detect anomalous behavior and alert in cases of attacks.Speciﬁcally, using universal probability assignment tech-niques based on the LZ-78 algorithm, we were able to sug-gest a modeling system which does not require any priorknowledge on the normal behavior, yet learns its statisticalmodel optimally, in the sense that it converges to the trueprobability assignment whenever the source is stationary andergodic. Together with the optimal decision rule, based on theNeyman-Pearson criteria, the probability assignments result inrobust and efﬁcient detection mechanisms. Moreover, as thetechnique suggested is based on practical universal compres-sion, it can be implemented with low complexity and minimalpre-processing overhead.To prove their applicability and test their performance,we applied the key techniques of this framework to severalproblems in computer security. The ﬁrst was detecting C&Cchannels of Botnets. We evaluated the system on real-worldtraces . In particular, we offered to use time differences betweenevents in the network as the key feature, and showed howthe context of such a simple feature, easily learned using thesuggested algorithm, enables the detection of most Botnets inthe data set with a negligible false alarm probability.We continued with additional applications such as mon-itoring system calls in order to detect malicious tools andidentifying data leakage. The results for these applicationsconcurred with our main tests on C&C detection, conﬁrm-ing the applicability of the suggested framework to severaldetection problems. Clearly, additional applications can besuggested. To name a few, we believe such tools can be usedto identify abnormal behavior of users in computer systems orabnormalities in large data networks based on trafﬁc patternsand communication partners of the tested nodes.R

EFERENCES[1] Q. Wang, Z. Chen, and C. Chen, “On the characteristics of the worminfection family tree,”

Information Forensics and Security, IEEE Trans-actions on , vol. 7, no. 5, pp. 1614–1627, 2012.[2] W. T. Strayer, D. Lapsely, R. Walsh, and C. Livadas, “Botnet detectionbased on network behavior,” in

Botnet Detection . Springer, 2008, pp.1–24.[3] G. Gu, J. Zhang, and W. Lee, “Botsniffer: Detecting botnet commandand control channels in network trafﬁc,” 2008.[4] S. Chang and T. E. Daniels, “P2p botnet detection using behaviorclustering & statistical tests,” in

Proceedings of the 2nd ACM Workshopon Security and Artiﬁcial Intelligence . ACM, 2009, pp. 23–30.[5] S.-K. Noh, J.-H. Oh, J.-S. Lee, B.-N. Noh, and H.-C. Jeong, “Detectingp2p botnets using a multi-phased ﬂow model,” in

Digital Society, 2009.ICDS’09. Third International Conference on . IEEE, 2009, pp. 247–253.[6] J. Francois, S. Wang, W. Bronzi, R. State, and T. Engel, “Botcloud: de-tecting botnets using mapreduce,” in

Information Forensics and Security(WIFS), 2011 IEEE International Workshop on . IEEE, 2011, pp. 1–6.[7] R. Villamar´ın-Salom´on and J. C. Brustoloni, “Identifying botnets usinganomaly detection techniques applied to dns trafﬁc,” in

ConsumerCommunications and Networking Conference, 2008. CCNC 2008. 5thIEEE . IEEE, 2008, pp. 476–481.[8] P. Burghouwt, M. Spruit, and H. Sips, “Towards detection of botnetcommunication through social media by monitoring user activity,” in

Information Systems Security . Springer, 2011, pp. 131–143.[9] S. Shin, G. Gu, N. Reddy, and C. P. Lee, “A large-scale empirical studyof conﬁcker,”

Information Forensics and Security, IEEE Transactionson , vol. 7, no. 2, pp. 676–690, 2012.10] S. S. Silva, R. M. Silva, R. C. Pinto, and R. M. Salles, “Botnets: Asurvey,”

Computer Networks , vol. 57, no. 2, pp. 378–403, 2013.[11] F. H. Abbasi, R. J. Harris, G. Moretti, A. Haider, and N. Anwar,“Classiﬁcation of malicious network streams using honeynets,” in

GlobalCommunications Conference (GLOBECOM), 2012 IEEE . IEEE, 2012,pp. 891–897.[12] M. Celenk, T. Conley, J. Willis, and J. Graham, “Predictive networkanomaly detection and visualization,”

Information Forensics and Secu-rity, IEEE Transactions on , vol. 5, no. 2, pp. 288–299, 2010.[13] G. Gu, R. Perdisci, J. Zhang, W. Lee et al. , “Botminer: Clusteringanalysis of network trafﬁc for protocol-and structure-independent botnetdetection.” in

USENIX Security Symposium , 2008, pp. 139–154.[14] B. AsSadhan, J. M. Moura, D. Lapsley, C. Jones, and W. T. Strayer,“Detecting botnets using command and control trafﬁc,” in

Network Com-puting and Applications, 2009. NCA 2009. Eighth IEEE InternationalSymposium on . IEEE, 2009, pp. 156–162.[15] F. Tegeler, X. Fu, G. Vigna, and C. Kruegel, “Botﬁnder: Finding bots innetwork trafﬁc without deep packet inspection,” in

Proceedings of the8th international conference on Emerging networking experiments andtechnologies . ACM, 2012, pp. 349–360.[16] W. Lu, M. Tavallaee, G. Rammidi, and A. A. Ghorbani, “Botcop: Anonline botnet trafﬁc classiﬁer,” in

Communication Networks and ServicesResearch Conference, 2009. CNSR’09. Seventh Annual . IEEE, 2009,pp. 70–77.[17] A. Este, F. Gringoli, and L. Salgarelli, “Support vector machines for tcptrafﬁc classiﬁcation,”

Computer Networks , vol. 53, no. 14, pp. 2476–2490, 2009.[18] C. Mazzariello and C. Sansone, “Anomaly-based detection of irc botnetsby means of one-class support vector classiﬁers,” in

Image Analysis andProcessing–ICIAP 2009 . Springer, 2009, pp. 883–892.[19] S. Siboni and A. Cohen, “Botnet identiﬁcation via universal anomalydetection,” in

Information Forensics and Security (WIFS), 2014 IEEEInternational Workshop on , Dec 2014, pp. 101–106.[20] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”

ACM Computing Surveys (CSUR) , vol. 41, no. 3, p. 15, 2009.[21] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of indi-vidual sequences,”

Information Theory, IEEE Transactions on , vol. 38,no. 4, pp. 1258–1270, 1992.[22] J. Ziv and A. Lempel, “Compression of individual sequences viavariable-rate coding,”

Information Theory, IEEE Transactions on ,vol. 24, no. 5, pp. 530–536, 1978.[23] M. Nisenson, I. Yariv, R. El-Yaniv, and R. Meir, “Towards behaviometricsecurity systems: Learning to identify a typist,” in

Knowledge Discoveryin Databases: PKDD 2003 . Springer, 2003, pp. 363–374.[24] R. Begleiter, R. El-Yaniv, and G. Yona, “On prediction using variableorder markov models,”

J. Artif. Intell. Res.(JAIR) , vol. 22, pp. 385–421,2004.[25] D. S. Fava, S. R. Byers, and S. J. Yang, “Projecting cyberattacks throughvariable-length markov models,”

Information Forensics and Security,IEEE Transactions on

Elements of information theory

Elements of information theory . JohnWiley & Sons, 2012.[31] N. Merhav and M. J. Weinberger, “On universal simulation of informa-tion sources using training data,”