[PDF] All Infections are Not Created Equal: Time-Sensitive Prediction of Malware Generated Network Attacks

Abstract

Many techniques have been proposed for quickly detecting and containing malware-generated network attacks such as large-scale denial of service attacks; unfortunately, much damage is already done within the first few minutes of an attack, before it is identified and contained. There is a need for an early warning system that can predict attacks before they actually manifest, so that upcoming attacks can be prevented altogether by blocking the hosts that are likely to engage in attacks. However, blocking responses may disrupt legitimate processes on blocked hosts; in order to minimise user inconvenience, it is important to also foretell the time when the predicted attacks will occur, so that only the most urgent threats result in auto-blocking responses, while less urgent ones are first manually investigated. To this end, we identify a typical infection sequence followed by modern malware; modelling this sequence as a Markov chain and training it on real malicious traffic, we are able to identify behaviour most likely to lead to attacks and predict 98\% of real-world spamming and port-scanning attacks before they occur. Moreover, using a Semi-Markov chain model, we are able to foretell the time of upcoming attacks, a novel capability that allows accurately predicting the times of 97% of real-world malware attacks. Our work represents an important and timely step towards enabling flexible threat response models that minimise disruption to legitimate users.

Full PDF

11 All Infections are Not Created Equal:Time-Sensitive Prediction of MalwareGenerated Network Attacks

Zainab Abaid ∗† , Dilip Sarkar ‡ , Mohamed Ali Kaafar †§ and Sanjay Jha ∗∗ School of Computer Science and Engineering, University of New South Wales, Australia { zainaba,sanjay } @cse.unsw.edu.au † CSIRO Data61, Australia. § Computing Department, Macquarie [email protected] ‡ Computer Science Department University of Miami, [email protected]

Abstract —Many techniques have been proposed for quickly detecting and containing malware-generated network attacks such aslarge-scale denial of service attacks; unfortunately, much damage is already done within the ﬁrst few minutes of an attack, before it isidentiﬁed and contained. There is a need for an early warning system that can predict attacks before they actually manifest, so thatupcoming attacks can be prevented altogether by blocking the hosts that are likely to engage in attacks. However, blocking responsesmay disrupt legitimate processes on blocked hosts; in order to minimise user inconvenience, it is important to also foretell the timewhen the predicted attacks will occur, so that only the most urgent threats result in auto-blocking responses, while less urgent ones areﬁrst manually investigated. To this end, we identify a typical infection sequence followed by modern malware; modelling this sequenceas a Markov chain and training it on real malicious trafﬁc, we are able to identify behaviour most likely to lead to attacks and predict98% of real-world spamming and port-scanning attacks before they occur. Moreover, using a Semi-Markov chain model, we are able toforetell the time of upcoming attacks, a novel capability that allows accurately predicting the times of 97% of real-world malwareattacks. Our work represents an important and timely step towards enabling ﬂexible threat response models that minimise disruption tolegitimate users.

Index Terms —Invasive software, Security and Protection, Markov processes, Botnet. (cid:70)

ACKGROUND AND M OTIVATION M O dern malware exploits the growing number ofinternet-connected devices and usually takes theform of botnets, large networks of infected machines jointlycontrolled by cyber criminals towards malicious ends. Fre-quently, machines compromised by botnet malware havebeen used to launch various very large scale attacks onthe internet. In recent years, increasingly large scale net-work attacks, such as distributed denial of service (DDoS)attacks [5] and point-of-sale credit card breaches [6] havebeen orchestrated using botnets. The potent danger of theseattacks lies in the fact that they can cause signiﬁcant lossesimmediately after occurring, for example ﬁnancial loss towebsite owners subjected to DDoS attacks and to userswhose private ﬁnancial information is leaked. Since thedawn of the botnet phenomenon and over the last twodecades, many techniques have been proposed to detectand mitigate infections and contain attacks. Unfortunatelyhowever, much of the damage of an attack is done by thetime it is detected and mitigated, even during the ﬁrst fewminutes. Thus, detecting an attack after it is launched isno longer sufﬁcient given the scale and potential of attacks.In fact, there is a need for an early-warning system thatcan predict attacks before they occur , so that network ad-ministrators can take steps to prevent the predicted attacksfrom materialising. For example, hosts that are suspected to engage in attacks in the near future can be monitored,manually inspected or taken off the network altogether.While the safest response is to auto-block hosts that arelikely to engage in attacks, this can lead to user inconve-nience when legitimate processes are unnecessarily stopped.We argue that it is important to realise that infections donot all behave the same – attacks may occur at differenttimes within different infections, and the urgency of theresponse should depend on the time available to respondto an attack. Thus there is also a need for a time-predictionapproach along with predicting the occurrence of attacks,so that the response to a prediction can be time-sensitive,and less urgent threats can ﬁrst be manually investigatedrather than blocking the infected host. We emphasise thatwe are dealing with prediction of outgoing attacks frominfected hosts, such as sending of spam emails or sendingout malware binaries, and not incoming attacks that causea previously benign host to become infected with malware,for example a web exploit that delivers malware binary to apreviously clean host.Publicly available intrusion detection systems (IDS) suchas Bro [24], Snort [26] and Suricata [3], as well as a largevolume of solutions from prior research, can detect thepresence of botnets in network trafﬁc with high accuracy,and detect botnet-generated attacks once they have been a r X i v : . [ c s . CR ] F e b launched. In this work, we empirically evaluate whether thisexisting technology can be harnessed to go a step furtherand predict and estimate the time of upcoming botnetattacks; to the best of our knowledge, no prior researchhas addressed this question. Instead, existing work in thisdomain falls into three broad categories: ﬁrst, solutions foridentifying botnet infections or attack behaviour withoutattempting to predict the occurrence of this behaviour;second, prediction of a speciﬁc botnet-generated attack; andthird, prediction of general intrusions without a focus onbotnets. Solutions in the ﬁrst category are able to identifya wide range of botnet infections and a broad spectrum ofbotnet-generated attacks, but cannot predict any maliciousbehaviour, including attacks, that may follow in future.Those in the second category are limited to predicting somespeciﬁc attacks for which they are designed, such as spam orDDoS; they are not able to generally predict any attack froma botnet. Conversely, solutions from the third category ofintrusion prediction tend to be too general for our purpose(rather than botnet-focused), and their potential to predictbotnet-generated attacks has not been evaluated. Thus, noneof these solutions can enable botnet-focused prediction ofattacks, and consequently cannot apprehend such attacksbefore they occur. This limitation has motivated us to designan early warning system that (a) works over existing, publiclyavailable IDS tools, (b) can predict any botnet behaviour ofinterest, including attacks, (c) is not limited to speciﬁc attackor botnet types, and most importantly, (d) is sensitive tothe temporal context within which attacks occur, and canestimate the time of occurrence of a predicted attack. Weevaluate our solution empirically on a dataset of real botnet-generated trafﬁc to address the question of whether exist-ing IDS technology can indeed be successfully harnessedtowards this goal of botnet-focused attack prediction.Our approach is based on the intuition that botnet-infected hosts show observable behaviour other than just at-tacks. Early botnet studies [8] have identiﬁed a typical infec-tion sequence (i.e. an ordered set of behavioural stages) fol-lowed by botnets, and traditional detection approaches [17]have effectively detected infections based on the occurrenceof this infection sequence. In this work, we similarly monitorfor the infection sequence, but move beyond traditionalinfection detection and instead focus on predicting futureattacks based on the current behavioural context of a poten-tially infected host. Speciﬁcally, as infections usually startwith incoming exploits (e.g. drive-by download or a remoteexploit of vulnerable services) resulting in bot malwaredownload, then cause communication with a command-and-control (C&C) server, and ﬁnally manifest as outgoingattacks, it is clear that attacks occur within a context ofdistinct behavioural stages. Thus, it is possible to ﬁt amodel to this context; training the model on real botnettrafﬁc should allow identifying which behavioural stagescommonly precede attacks and thus allow generating earlyattack warnings when those stages are observed. Moreover,capturing the typical temporal distribution of these differ-ent stages can allow predicting the times of occurrence ofattacks given the observation history of an infected host.We implement our approach by identifying the botnetinfection sequence using earlier studies and an analysis ofmore recent, real-world botnets, and ﬁtting various Markov models to this sequence. We train the models on twodatasets comprising trafﬁc from a variety of malware fami-lies, and propose a methodology to predict future behaviourof an infection based on past actions of a host. With a simpleMarkov chain model, we are able to predict attacks with98% accuracy. Moreover, we show that our methodologycan be used to predict other behaviour of interest, suchas an infected host’s communication with its C&C server.Secondly, we address the issue of how much history toincorporate into a future prediction. We identify that higher-order Markov chains allow incorporating a variable amountof history into a prediction decision, and empirically inves-tigate whether more history is useful for attack predictionby designing a set of higher-order Markov chains andcomparing their prediction accuracy with the ﬁrst-orderMarkov chain. Furthermore, we build a semi-Markov chain(SMC) to enable a ﬂexible and time-sensitive response toimpending attacks, where less urgent threats can be handleddifferently from urgent threats. The SMC is able to learn atime-probability distribution from malware trafﬁc, and incombination with the underlying Markov chain, is able togenerate accurate time-predictions for attacks as well as anyother future behaviour of botnet-infected hosts. With thismodel, we can make time predictions regarding when aparticular behaviour will occur, with accuracy ranging from94% to 100% for different kinds of behaviour. To the best ofour knowledge, this time prediction is a novel capability notdemonstrated in any existing malware detection research. Inpractice, we envision our solution to be deployed as part ofa ﬂexible, time-sensitive threat response system which canselectively block hosts posing the most urgent threats andallow scheduling manual inspections of hosts posing lessurgent risks. This would minimise user inconvenience re-sulting from unnecessary disruption of legitimate processeswhen hosts are blocked simply for being infected.Overall, this paper makes the following contributions: • An empirical validation of the botnet attack predic-tion potential of our proposed approach using realtrafﬁc from infected machines; • An early warning system that can make predictionsof future behaviour, including attacks, of botnet-infected hosts with 98% accuracy, and without theneed for prior knowledge of infection. • A model for the time distribution of botnet eventswith the novel capability of predicting times of oc-currence of future malicious events including attackswith up to 100% accuracy; • An investigation of whether considering greater his-torical context is advantageous for the botnet attackprediction problem.The remainder of this paper is organised as follows.Section 2 discusses related work in this domain. Section 3deﬁnes some key terminology. In Section 4 we design amodel to capture the botnet infection sequence, and inSection 5, we instantiate different attack prediction mod-els. Section 6 presents some empirical analysis of the datarelevant to our prediction approach, and Section 7 outlinesour experimental methodology and evaluation results. Sec-tion 8 discusses the practical feasibility of our approach, andSection 9 concludes the paper.

ELATED W ORK

While identiﬁcation of botnet infections and attacks hasbeen a well-researched problem, a solution for predicting di-verse botnet attacks while remaining independent of botnetand attack type has remained elusive. In addressing thisgap, we have drawn inspiration from three broad categoriesof prior research: (a) identifying diverse botnet infectionsbut without a focus on prediction, (b) predicting speciﬁcbotnet attacks (i.e.solutions that are not generic to diverseattack types), and (c) intrusion prediction without a focus onbotnets. We now discuss some key approaches belonging toeach of these categories and also review the use of Markovmodels similar to ours in prior intrusion detection research,and compare our work with this existing research.In the ﬁrst category of research, the idea of a bot “lifecy-cle”, ﬁrst proposed in [8], has been used to identify botnetinfections, by monitoring for the various behavioural stagesthat a botnet infection exhibits; infections are declared ifmultiple such stages are observed on a host in a particularsequence [16]–[18]. An example of this approach is foundin BotHunter [17], where a host is classiﬁed as infected ifit engages in communication sessions belonging to multiple“lifecycle” stages, such as malware downloading, C&C com-munication, and performing attacks. This approach formsan inspiration for our work as we too monitor hosts for signsof various bot lifecycle stages. However, instead of stoppingat declaring that an infection has occurred, we go a stepfurther and use the identiﬁed behaviour to predict what thefuture behaviour of the host is likely to be and whether (andwhen) it will engage in attacks.In the second category of work on predicting speciﬁcattacks, only a limited amount of prior research exists. Onescheme [19] uses the increase in traceroute packets in thenetwork prior to a target link DDoS attack to predict theattack itself, detecting the preparation stage before the attackoccurs. This approach however is restricted to a speciﬁcattack. To the best of our knowledge, no current researchhas investigated a botnet-focused attack prediction solutionnot restricted to speciﬁc attack types.Closest to our work is the third category of research onintrusion alert prediction, i.e. training a model on an ob-served sequence of alerts from an IDS and using it to predictfuture alerts [14], [20], [30]. [30] clusters intrusions alertsgenerated by the Snort IDS in training trafﬁc, and trains anHMM where the alert clusters are used as the observablesymbols emitted by the hidden states, and calculates themost likely next alert category to be generated. While thiscategory of work matches our goal of predicting futuremalicious activity, our work differs in its focus on botnet-generated attacks as opposed to general intrusion alerts.Our approach ﬁrst maps alerts to known stages typical ofbotnet behaviour, discarding all alerts that do not signifybotnet activity, which greatly reduces the amount of datato be processed by the system in both the training anddeployment phase. To the best of our knowledge, ours isthe ﬁrst work to empirically evaluate the effectiveness ofintrusion prediction in predicting botnet attacks.Finally, our current work draws inspiration from a largebody of prior research that has proposed various Markovmodels in the context of intrusion detection. These in- clude single-order as well as higher-order Markov chains,semi-Markov chains (SMCs), and Hidden Markov Models(HMMs). A typical example of single-order Markov chainsin intrusion detection is [34], which trains a Markov chainusing events generated by an audit system on a host. In thedeployment phase, event streams generated by hosts thatare dissimilar to the learned benign model are classiﬁedas attacks. Similar approaches are presented in [13], [35].Recently, Markov chains have been applied to Androidmalware detection, where the sequence of API calls [23]or system service calls [27] in an application are used tobuild a Markov chain, which is then used as the featurevector for that application. A classiﬁcation algorithm (e.g.Random Forest) is trained over Markov chains of maliciousand benign applications in a training set and used to classifyfuture applications as malicious or benign. This approach isinherently different from ours as Markov chains are usedas feature vectors for classifying applications rather thanto predict the next event in a sequence, and hence cannotbe used for attack prediction. HMMs have been used oftenfor anomaly detection, where the anomalous behaviour tobe detected is represented as a sequence of symbols, andthe states of the model that are not otherwise observable,are assumed to emit those symbols. The observed data isthen checked for similarity to the trained model to detectanomalies. This approach has been applied to botnet detec-tion [15], [22], [31], and recently also to program anomalydetection [33]. However, these approaches are restrictedto identifying infections or malicious behaviour that hasalready started, rather than predicting the future behaviourof potentially infected hosts, which is the focus of our work.One issue that arises when using Markov models forprediction of future behaviour is the amount of history toconsider when making a prediction. Higher-order Markovchains [25] in which the last m states determine the nextstate, incorporate more history than single-order chains, andhave been used in predicting web browsing behaviour ofusers given their past observed history of web requests [11],[12]. Intrusion detection applications are less common; onesuch work [21] partially uses a higher-order Markov chain,along with other models, to differentiate between commandsequences from legitimate system users and intruders. An-other issue in prediction is that one may be interested notjust in the next transition but also in its time of occurrence.SMCs [9] address this by incorporating a time probabilitydistribution in the Markov chain. While SMCs have not beenused in predicting intrusions, an inspiration for our researchis [32] which uses a hidden SMC to differentiate betweenlegitimate and anomalous web-browsing behaviour (thatcould indicate DoS attacks) based on the requests receivedby web servers. To the best of our knowledge, neitherhigher-order Markov chains nor SMCs have not been usedin botnet attack prediction; part of our work contributes tothis gap by systematically performing such an investigation.In summary, we believe that there is no existing ap-proach, whether using Markov chains or otherwise, thatevaluates whether botnet attacks, as well as their times ofoccurrence, can be accurately predicted while remainingindependent of botnet or attack type. INGLE -O RDER , H

IGHER -O RDER AND S EMI -M ARKOV C HAINS

Markov chains are used to model systems that randomlymove among a set of states. In general, a Markov chainis represented as a state transition diagram with a set ofstates [ S ] , with the paths between the states weighted bythe probability of moving from one state to another; eachtransition from a state S i to a state S j has a probability t i,j of occurrence. We now discuss some types of Markov chainthat we use for developing our attack prediction models. A single-order Markov chain is deﬁned by a strict memory-less property, i.e. the next state depends only on the currentstate; neither older states nor any other variables, such asthe time spent in the current state, have any effect on deter-mining the next state. We brieﬂy deﬁne some terminologythat we later refer to in developing our single-order Markovchain model. An irreducible

Markov chain is one in which allstates can communicate – i.e. for each pair of states S i and S j , there exists a path from S i to S j and vice versa. A stateis periodic if it can only be returned to at time values that aremultiples of an integer greater than 1, e.g. at time t = 3 , , and so on. The period of a state is calculated by taking thegreatest common denominator (GCD) of all possible timesof return to the state. States with a period of , such as thosewith a self-transition, are aperiodic , and a chain for which allstates are aperiodic is an aperiodic chain. For an irreducibleMarkov chain, it is possible to deﬁne a stationary distribution representing the long-term probabilities of being in each ofthe states, i.e. P ( S i ) for all states S i as time grows large. Thisdistribution should satisfy the properties P j = (cid:80) i ∈ [ S ] P i t i,j and (cid:80) i ∈ [ S ] P i = 1 , where [ S ] is the set of states in themodel, P i is the stationary probability of being in state S i ,and t i,j is the probability of going from state S i to state S j .If the chain is aperiodic, then this stationary distribution isalso the limiting distribution for the Markov chain. Finally, a reversible Markov chain has to satisfy for all pairs of states S i and S j the property P i t i,j = P j t j,i ∀ i, j where P i and t i,j are as deﬁned above. In a single-order Markov chain, the probability of the nextstate depends only on the current state; an m th order chainextends this dependence, such that the probability of thenext state depends on the last m states up to and includingthe current state [25]. Higher order Markov chains are usefulin situations where the prediction of the next event requiresmore history than just the previous event, i.e. where dif-ferent combinations of multiple events carry different inﬂu-ences on the future. Therefore, higher-order chains appearto be a good ﬁt to the botnet attack prediction problem.Intuitively, it seems possible that different combinations andsequences of botnet events may lead to attacks with differentprobabilities. For example, a certain botnet may behave asfollows: when a host downloads a malicious binary, it ﬁrstconnects to the C&C server, and then downloads furtherexecutables containing attack instructions. Then it engagesin an attack. Thus, attacks should be predicted when the sequence “ Binary Download → C & C Communcation → Binary Download ” is observed, rather than after onlythe Binary Download event is observed. However, thisdistinction cannot be captured with a ﬁrst-order Markovchain. Applying a higher-order Markov chain to the attackprediction problem would allow us to investigate whetheran improvement in attack prediction can be achieved bylooking farther back into the past when making a prediction.

Regardless of its order, a Markov chain ignores any effect ofthe time spent in the current state on determining the nextstate, which we refer to as the state holding time distribution .However, in the context of attack prediction, it is importantto not just detect when the Attack state will occur, but alsoto estimate how much time will be spent in the current statebefore it does occur. The response to an attack warning canthen be varied according to the amount of time available.For example, if an attack is predicted to occur within aminute of the warning, the best response may be to quar-antine the potential attacker host immediately. However,if the attack is predicted to occur after an hour, manualveriﬁcation and response to the problem may be possible.The advantage of predicting the time of an attack wouldbe to reduce the number of auto-quarantine responses towarnings, which are not desirable as completely denyingnetwork connectivity to infected hosts would pose a serioususability problem to legitimate users. Legitimate processesshould be allowed to run while attacks are investigatedand blocked. A semi-Markov chain [9] offers a solution toestimate the holding times of states.A semi-Markov chain can be applied over a standardMarkov chain. The holding times for the states of the embedded

Markov chain are assumed to be random variablesobeying a distribution, which can be estimated from trainingdata. Thus, a semi-Markov chain is deﬁned primarily byan underlying Markov chain with a transition probabilitymatrix T ij , and a time probability distribution F ij ( t ) whichdescribes the probability of each possible transition t ij oc-curring within a varying time interval (0 , t ] . In a continuous-time semi-Markov chain, the time t is continuous and maybe any real number, while a discrete time Semi-Markovchain divides the time into speciﬁc intervals where tran-sitions occurring within a particular interval ( t x , t y ] are allgrouped together. In this work, we design a discrete-timesemi-Markov chain, where it is possible to determine, for agiven interval I n , the probability of a transition from anystate S i to any state S j within the interval I n . Thus, whenwe predict an attack, we can also predict its approximatetime of occurrence. ESIGNING A M ODEL TO C APTURE M ALWARE I NFECTION S EQUENCES

In this section, we present a model for capturing the typicalsequences observed within malware infections. Deﬁningsuch a model is a pre-requisite for the core contributionof our work, i.e. predicting attacks, as it represents theunderlying state diagram of all three Markov chain modelsthat we use for prediction.

Fig. 1: State transition diagram modelling botnet infection stages.

Prior studies [8], [17], [29] have suggested that bots follow awell-deﬁned sequence, beginning with a social engineeringattack or an exploit, which results in a malicious binarydownload, followed by communication with a C&C serverand then launch of an outgoing attack, such as spamgeneration or DDoS attacks. We examine the behaviouralstages exhibited by a range of modern botnets, includingrecent mobile botnets, to verify that this sequence remainsinvariant, and conclude that outgoing attacks occur withina context that can be modelled. We capture this context asa state transition diagram, shown in Fig. 1, with each staterepresenting a different stage of infection.

States of the Model:

Table 1 describes the behaviour eachstate in our model represents and exempliﬁes each stagewith a recent banking trojan called Dridex, mapped to therelevant states of our model. We derive this informationfrom an analysis of

Dridex 120 published by CISCO [4].Our model is one of many possible representations of botnetbehaviour; for example, each of its states can be brokendown into multiple states by considering different varia-tions of a behavioural category. However, we restrict ourmodel to eight states that condense all variations of eachstage of behaviour (including attacks) into one high-levelstate. We believe that this model provides comprehensivecoverage of behaviour exhibited by a typical modern botnetinfection.

Full Mesh Structure of the Model:

Unlike previous work,in which states are linked in a logical ﬂow that the infectionshould proceed in, we link the states in a full mesh toavoid imposing our own assumptions about the infectionsequence on the model. The sequence in which the stagesoccur can and does vary considerably between malwaretypes. In fact, botnet traces used in the experimentationshow signiﬁcant variation in the sequence of stages ob-served before attacks occur. The full mesh structure ensuresthat a compromised host does not have to follow a setsequence of stages for our attack prediction to work, as longas at least some of the stages can be detected on the host,and the training data contains a few different variations in the sequence.

Applying the comprehensive model of Fig. 1 entails beingable to detect activities falling within each state of themodel. To the best of our knowledge, this is not possiblewith existing tools. While tools such as Suricata [3], Bro [24]and Snort [26], are capable of detecting several such activi-ties, at present there is no rule set or script publicly availablewith these tools that contains detection logic for all thebehavioural stages in our model. Currently, we use Snortas our detection tool, with botnet detection rule sets fromEmerging Threats [1] as well as from BotHunter [17], anindustry-standard botnet detection IDS, a combination thatallows us to detect a greater variety of malicious activitiesthan any other existing tool. We acknowledge that ourmodel is IDS-dependent in its effectiveness. If the under-lying IDS misses any of the malicious behaviour that a hostengages in, it would compromise the attack prediction accu-racy of our model. However, improving the IDS is beyondthe scope of our work, which simply intends to demonstratethat applying our models over any existing IDS can allowextending its detection capability to prediction. We nowmerge the Drive-by Download state into the Exploit state,and the C&C Discovery state into the C&C Communicationstate, as Snort does not always distinguish between thesepairs. We remove the social engineering state as detectingit entails being able to link a social media message or anemail to a downloaded piece of malware, which is notpossible with existing tools. We remove the Inbound Scanstate as it is never detected within our current dataset.For the remainder of this paper, we work with a four-statemodel comprising the states

Exploit, Binary Download, C&CCommunication, Attack linked in a full mesh. However, allour methodology is applicable as it is to the full model,including the calculations and the proofs of validity thatfollow in the next section.

RAINING A TTACK P REDICTION M ODELS

We now describe how we instantiated and trained threeattack prediction models – a single-order Markov chain, aset of higher-order Markov chains, and a semi-Markov chain– on two malware datasets.

We used two trafﬁc traces containing known malware ac-tivity to instantiate our model. Note that it is only forevaluation purposes that we use datasets of known infectedhosts; in a practical deployment there is no need to pre-labelthe hosts as infected or clean.

The SysNet trace has been generated by the SysNet lab [2]in July 2013 and contains trafﬁc from ten infected hosts.It was generated by running ten bot binaries in separatevirtual machines: four variants of Pushdo, two variants ofSality, and one each of Kolabc, Virut, Dorkbot, and Bobax.This covers HTTP, IRC, and P2P-based bots that engage ina range of attacks including sending spam and outboundscanning. Full packet traces from the virtual machines werecaptured for 24 hours using Wireshark on host machines.

TABLE 1: Description of states of the Markov Chain and mapping

Dridex 120 behaviour to each state (where applicable).

State Description of Behaviour Example of Dridex Behaviour

Inbound Scan An incoming port scan on any port of the host, characterisedby the same port being scanned on many hosts (horizontalscan) or many ports being scanned on one host (vertical scan). N/ASocial Engineering Human users are tricked into compromising on usual securitypractices, e.g. clicking on an intriguing (but malicious) spamlink in an email message. Victim receives email with a malicious Microsoft Word or Excelattachment.Drive-by Download Malicious executables are downloaded without the user’s con-sent during a web browsing session. User opens the email attachment, triggering execution of em-bedded macro, which then causes download of an intermedi-ate dropper.Exploit Any attack targeting application or OS vulnerabilities for gain-ing “backdoor” access or remote execution privileges on thevictim machine, or web browser. N/ABinary Download The actual download of a malicious piece of code or anexecutable, which may be disguised as legitimate software oran image or document. Dropper executes, and downloads and runs the actual botbinary.C&C Discovery When the downloaded binary is run and attempts to contactthe botnet’s C&C server. Discovery behaviour may includecontacting a list of servers one by one over HTTP, openingmultiple P2P connections, or a large number of DNS queriesfor a list of possible domains. Sends an HTTP Post request to a speciﬁc hard-coded IP ad-dress.C&C Communication Exchanges with the discovered its C&C server, over IRC,HTTP, or in some cases, custom protocols. If HTTP POST is successful, bot receives conﬁguration andinstructions regarding which websites to target for redirectionattacks.Outgoing Attack Outgoing trafﬁc that can damage other entities; e.g. port scan-ning, information stealing, phishing, DoS or spam. When victim visits a banking website included in the con-ﬁguration ﬁle, the bot can perform redirection to a phishingwebsite and credential stealing.

TABLE 2: Example of sequences of behavioural stages observed in dataset.

IP Event Sequence a.a.a.a 2:CNC, 3:CNC, 4:ATTACK, 7:EXPLOIT, 9:CNC, 13:BINARY, 16:EXPLOIT, 18:ATTACK, 23:CNC ... b.b.b.b 0:EXPLOIT, 6:EXPLOIT, 7:BINARY, 19:ATTACK, 23:CNC, 26: BINARY, 27:ATTACK, 29:EXPLOIT ... c.c.c.c 1:ATTACK, 2:ATTACK, 6:EXPLOIT, 7:BINARY, 9:CNC, 10:ATTACK, 11:CNC, 14:EXPLOIT, 18:BINARY ...

This trace was collected by researchers at the ISCX Lab-oratory at UNB, Canada and made available publicly. Itcontains full traces of 30 infected machines, nearly half ofwhich are infected by IRC botnets and the remainder bya variety of P2P and HTTP botnets, including variants ofZeus, Virut, NSIS, Storm, and Zero Access among others.Interested readers are referred to [10] for details of thetrace. Unfortunately, not all hosts showed attack behaviourdetectable by current intrusion detection tools; therefore, wedropped such hosts and selected a total of hosts’ tracesfor inclusion in our experimentation, on the basis that theyshowed evidence of engaging in attacks. The duration of theselected traces varies from a few minutes to nearly 3 days.Because of the short duration of several of these traces, wefound that this dataset by itself is unsuitable for splittinginto training and testing portions for evaluating the model.Thus, we combined traces from both datasets into a singledataset comprising hosts from the SysNet trace and hosts from the ISCX trace.For training the Markov chain, the traces were mappedto state sequences comprising the states in our model,i.e. Exploit, Binary Download, C&C Communication andAttack. We ran Snort, which runs over trafﬁc and generatesalerts when a rule is triggered, over our data using botnetdetection rules from Emerging Threats and BotHunter; nextwe post-processed the Snort logs to identify alerts belongingto one of the states in our model, discarding irrelevantalerts. We then generated a timestamped sequence of states observed in each host’s network trafﬁc. A snapshot of thedata used to instantiate our model is shown in Table 2.The timestamp represents minutes from start of the trace;for example, “ ” indicates that a C&C Communicationevent was detected at time 2 minutes from the start of thetrace. We next describe how we use this data to train ourprediction models. The ﬁrst step in developing a single-order Markov chain isto instantiate our proposed model (from Section 4.2) withthe state sequences extracted from the dataset and verifythat it represents a valid Markov chain, according to theproperties deﬁned in Section 3.1. To this end, we ﬁrst discussthe irreducibility and aperiodicity of our model, calculate itstransition and stationary probabilities to empirically verifythat the stationary distribution represents a valid Markovchain, and ﬁnally verify its reversibility.

Showing our Markov chain to be irreducible requires noproof. As discussed in Section 3.1, irreducibility simplyrefers to a chain where all states can communicate witheach other. The states in our model are connected in a fullmesh so this is indeed the case. Also, as all states in ourmodel have self-transitions; therefore the Markov chain isaperiodic. This allows us to label the stationary distributionas the limiting distribution for this chain, which can only bevalid if the chain is aperiodic.

To empirically verify that the stationary probability dis-tribution that we have obtained fulﬁls the conditions fora Markov stationary distribution outlined in Section 4.2,we need to derive the stationary probabilities representinglong-term likelihood of being in each state. To this end, weuse the ﬂow balance condition of a Markov model, i.e. theprobability of leaving a state must equal that of enteringit. This allows us to formulate a balance equation for eachstate of the model. In general for each state S i that hasincoming transitions from a set of n states and has outgoingtransitions to a set of m states: n (cid:88) k =1 P k t k,i = m (cid:88) j =1 P i t i,j (1)where P i is the limiting probability of being in each state,and t i,j is the transition probability from state S i to state S j . Based on this we deﬁne the following ﬂow balanceequations for each of the states in the model from S to S respectively: P t , + P t , + P t , = P t , + P t , + P t , (2) P t , + P t , + P t , = P t , + P t , + P t , (3) P t , + P t , + P t , = P t , + P t , + P t , (4) P t , + P t , + P t , = P t , + P t , + P t , (5)where P i ∀ i .. and t i,j are are deﬁned above.In order to calculate the value P i for each state S i , weﬁrst need transition probabilities t i,j . Therefore we ﬁrstgenerate a transition probability matrix T from the datasetof state sequences with each cell [ i, j ] representing the prob-ability of a transition between states S i and S j . A commonapproach to calculate the transition probability matrix is tobuild a training sequence of states by observing the systemfor some length of time, and then generate the probability t i,j for each pair of states S i and S j as follows: (cid:80) N i,j (cid:80) k ∈ [ S ] N i,k (6)where (cid:80) N i,j represents the number of transitionsobserved from state S i to S j , and (cid:80) k ∈ [ S ] N i,k represents thetotal number of transitions observed from state S i to anystate in [ S ] . For our model and dataset, T is a × matrixand contains the following values: T =  .

682 0 .

030 0 .

033 0 . .

035 0 .

426 0 .

527 0 . . .

001 0 .

926 0 . .

001 0 . .

099 0 .  The states represented by each row and column are Ex-ploit, Binary Download, C&C Communication and Attack,respectively. To solve for P i , we use the values from thematrix T and replace Equation 5 with the conservation oftotal probability equation (cid:80) i ∈ S P i = 1 . This allows us to TABLE 3: Stationary probability distribution for the dataset showinglong-term probability of being in each state.

State Probability

Exploit P = 0 . Binary Download P = 0 . C&C Communication P = 0 . Attack P = 0 . obtain the stationary distribution for our dataset shown inTable 3; it appears that C&C Communication is the activitythat infected hosts tend to engage in most frequently.We verify empirically that the stationary probabilitydistribution holds true to both conditions speciﬁed in Sec-tion 3.1, i.e. the conservation of total probability and thecondition P i = (cid:80) j ∈ S P j t j,i . The calculations are straight-forward and we do not show them here. Reversibility requires that for all states S i , S j : P i t i,j = P j t j,i .Using the calculated stationary probabilities and the matrix T , we verify empirically that this property holds for allstates in the model. We omit the calculations for brevity. We build a set of higher-order Markov chains up to order9, and train each chain on our data as follows. For a givenchain of order- m , we learn a transition probability matrix T ( m ) – the difference from the order- matrix shown inSection 5.2 is that the rows no longer correspond to singlestates, but instead to ordered length- m sequences of thestates in [ S ] . We refer to the sequence represented by row i as seq i . Each cell t i,j of the matrix is calculated as follows: (cid:80) N ( m ) seq i ,j (cid:80) k ∈ S N ( m ) seq i ,k (7)where N ( m ) seq i ,j is the number of transitions from the sequence seq i to a state j , and (cid:80) k ∈ S N ( m ) seq i ,k is the number of transitionsfrom the sequence seq i to any state in [S]. We apply a discrete-time semi-Markov chain over the sameunderlying Markov chain as deﬁned in Section 5.2 andempirically estimate a holding time distribution, F ij ( t ) asfollows. As the main objective of estimating time is to beable to respond to attacks differently depending on thelength of time available, we divide the possible holdingtimes into eight intervals representing various lengths oftime available for taking preventative measures against pre-dicted attacks. A network administrator receiving a warningthat an attack will occur after a particular length of timecan decide on a suitable preventative measure accordingto the available time. The set of intervals I is shown inTable 4. The ﬁrst interval (0 , is just a minute long – thisis especially adapted to our dataset where we notice a veryhigh proportion of states occurring very close together (lessthan a minute apart). If an attack is predicted to occur within TABLE 4: Time intervals for Semi-Markov Chain, in minutes. ID Interval < t ≤ < t ≤

10 10 < t ≤

20 20 < t ≤

30 30 < t ≤

40 40 < t ≤

50 50 < t ≤ t > TABLE 5: Statistical properties of duration (in minutes) of self-transitions for each state.

State Min. Max. Mode Mean Stdev.

Exploit < < . . Binary Download < < .

75 3 . C&C Comm. < < . . Attack < < . . this interval, the only possible preventative action can be toquickly – and perhaps automatically – block all trafﬁc fromthe attacking host. The next six intervals are ten minuteseach, and the eight interval represents any length of timeover an hour. Attacks predicted to occur within the eighthinterval can almost certainly be manually investigated.Having deﬁned the set of intervals I , we represent F ij ( t ) as a set of eight transition probability matrices, onepertaining to each time interval I n where ≤ n ≤ . Wethen learn the eight matrices using an approach similar tolearning the transition probability matrix T ij discussed inSection 5.2. That is, for a given interval I n , the element i, j of the matrix corresponding to I n represents the probabilityof a transition from state S i to state S j within the interval I n , and is calculated as follows: (cid:80) N i,j ( I n ) (cid:80) k ∈ [ S ] N i,k (8)where (cid:80) N i,j ( I n ) represents the number of transitions ob-served from state S i to state S j within the interval I n , and (cid:80) k ∈ [ S ] N i,k represents the total number of transitions from thestate S i to any other state in any time interval.After learning the distribution F ij ( t ) , the holding timefor each state ( H S i ) can then be estimated in terms of aprobability for a given time value t x as follows: H S i ( t x ) = P { T S i ≤ t x } = (cid:88) j ∈ [ S ] F ij ( t x ) P ij (9)where H S i is the time spent in state S i before any transition,and F ij ( t x ) is the probability of the next transition fromstate S i to state S j occurring within time t x and can beobtained from the learned F ij ( t ) distribution after mapping t x to one of the pre-deﬁned intervals in [ I ] ; P ij is theembedded transition probability from state S i to S j . N EMPIRICAL ANALYSIS OF SELF - TRANSITIONS

Before we detail the experiment methodology for attackprediction using our trained models, we discuss an im-portant issue: whether or not to consider self-transitionsin the data. Self-transitions arise when the underlying IDSgenerates multiple consecutive alerts for the same state; forexample, a host continuously engaging in a port scan attackmay generate a stream of scan alerts, which we interpret as

TABLE 6: Statistical properties of the number of self-transitions foreach state.

State Min Max Mode Mean Stdev

Exploit . . Binary Download . . C&C Comm. . . Attack . . TABLE 7: Example of Consecutive Attack Alerts a sequence of self-transitions from the Attack state in ourmodel. Table 7 shows an example of two consecutive portscan alerts. The rule [7] causing the alert is triggered bya threshold of outgoing connections being crossed withina certain time window, in this case if 70 connections aremade in 60 seconds. Thus, every 60 seconds, the rule willbe triggered again, causing many consecutive alerts for theAttack state if the host is engaging in the activity for anextended period of time, for example several minutes. Weobserve that consecutive alerts for the same event occurfrequently in our dataset; however, this could well be anartefact of using Snort as the IDS. A different IDS maygenerate only a single alert for the entire duration of amalicious activity. Regardless, we discuss the consequencesof self-transitions below.We ﬁnd that when we build a transition matrix from theoriginal dataset, as shown in Section 5.2, the self-transitionprobabilities for the C&C Communication and Attack statesoverwhelm the probabilities of transitions leading to otherstates. Thus, when we predict the most likely transition fromthese states, we end up always predicting the self-transition;in fact, the only time we can possibly predict attack is whenthe current state is the Attack state. This is not useful, aswe need to know when a non-attack state will transitionto attack, not whether the Attack state will remain in itself.Thus we have to consider whether it is possible to drop self-transitions by considering each set of consecutive alerts as asingle alert. Although we are chieﬂy interested in predictingwhen a state changes (i.e. when a non-attack state transitionsto the Attack state) as opposed to its duration (representedby how long it transitioned to itself), it is still important tosee whether there is a discernible pattern in the duration or number of each state’s self-transitions. Such a pattern,if it existed, would be important, because some non-attackstates (say C&C Communication) may nearly always self-transition for a certain duration before they go to attack;this would allow us to to generate a warning that an attackis likely to happen after that certain duration whenever wesee the C&C Communication state. To this end, we performan empirical analysis on our dataset.Tables 5 and 6 show the statistical properties of thedurations (in terms of number of minutes) and number (i.e.number of consecutive alerts) respectively of self-transitionsof each state. We ﬁnd that there is no predictable patternin either duration or number of transitions. The tables showthat the standard deviation of both the duration and numberof self-transitions is very high compared to the mean foralmost all states, and the difference between the minimumand maximum is also generally very large. We do a similaranalysis for each individual host, not shown here owingto space constraints, but ﬁnd that even within individualhosts infected by a single bot, the pattern is unpredictable,with values of both duration and number widely scattered.Table 5 shows that the minimum duration of each stateis less than 1 minute, and this is also the modal valuefor all states. This, combined with the fact that the actualduration is unpredictable owing to its high variance, leadsus to conclude that whenever we predict an attack followinganother state, we should conservatively always predict thatit is likely to follow in less than a minute. The actual dura-tion after which it will follow then becomes irrelevant. Thedownside of this approach is that when an attack does notin reality follow within a few minutes, but rather after manyhours, valuable resources will be wasted in monitoring thesuspected host for a long time, or its communications willbe unnecessarily restricted. However we ﬁnd this acceptablebecause long durations of self-transitions are uncommon.Out of a total of occurrences of the Attack state inour dataset, only times the duration is over minutesin length. Similarly only out of occurrences ofthe C&C Communication state last longer than minutes.Therefore, although self-transitions are part of our origi-nal model, we now decide to ignore all self-transitions inthe data when training our model, and instead build ourtransition matrix from a dataset where we collapse all selftransition sequences into a single state, i.e. a number ofconsecutive alerts for a state will be considered a singlealert. In the testing stage, when we make a prediction froma state, we ignore further alerts for the same state, untilthere is a state change, at which point we make anotherprediction. Thus, this strategy of ignoring consecutive alertswould work for both kinds of IDS, i.e. those that repeatedlygenerate a stream of alerts for the same instance of maliciousactivity (e.g. separate messages exchanged during a singleconnection with a C&C server) and those that group allconsecutive alerts into a single alert for the activity (e.g. onealert per connection with a C&C server).We now obtain the following transition probability ma-trix, which we call T (cid:48) : T (cid:48) =  . . . . . . . . . . . .  As before, the states represented by each row and columnare Exploit, Binary Download, C&C Communication andAttack, respectively. The values along the diagonal are now as self-transitions have been removed. However, it can beveriﬁed that this model still satisﬁes all Markov propertiesdiscussed in Section 3.1. All states of the model communi-cate with each other; hence, the chain is irreducible. Eachstate is aperiodic as it can be returned to at time 2,3,4 and soon; thus, the period, i.e. the GCD (greatest common denomi-nator) of all possible return times, is 1. Finally, the propertiesof reversibility and a valid stationary distribution can beveriﬁed through calculations as discussed in Section 5.2 –we omit the calculations for brevity. XPERIMENTAL M ETHODOLOGY AND E VALUA - TION R ESULTS

We now investigate the accuracy with which we are able topredict attacks . Our prediction methodology is as follows. We ﬁrst divided the dataset temporally into training andtesting data - i.e. for each host, we considered the ﬁrst minutes of data as training, and the remainder as testingdata. We chose this rather short training interval as the dataduration varies considerably across traces, with some traceslasting only to hours. Therefore, using longer trainingintervals leaves fewer hosts’ traces available for testing themodel. We trained and evaluated each model as follows. To make predictions using the single-order Markov chain,we ﬁrst learn a transition probability matrix from trainingdata, similar to the matrix shown in Section 5.2, but with dif-ferent values as it was built on less data. Then, given a set ofstates [ S ] , the current state S i , and a set of transition proba-bilities t i,j ∀ j ∈ S , we predict the next state to be the state S j such that the transition probability P ( S i → S j ) = max j ∈ S t i,j .This means that we predict the Attack state whenever it isthe destination of the most likely outgoing transition fromthe current state. We iterate over the testing data, treatingit as a real-time, previously unseen stream of events. Afterobserving each state, we predict whether the next state tofollow was likely to be the Attack state. Thus, whenever anew state is observed, we immediately make a predictionfor the next state to follow. We implement a set of chains from order-2 to order-9,learning a transition probability matrix T ( m ) for each order m which contains probabilities for all four states beingpreceded by all possible state sequences of length m . In theprediction phase, we iterate over the event stream, using m events to predict the next event. For example, when anew event E t occurs at time t , we look up the row of thesequence [ E t − m , ...E t ] in the matrix T ( m ) and predict themost likely next event E t +1 .

1. The attacks in our evaluation dataset are restricted to sending spamemails and performing outbound scans; while the current results onlyreﬂect these two activities, in theory our framework can predict anyother attack, as the Attack state could represent one of many differentattacks. We learn eight transition probability matrices from the data(not shown for brevity), one for each of the eight time in-tervals in [ I ] shown in Table 4, representing the distribution F ij ( t ) . Each matrix T I N contains the transition probabilitiesamong all states within the time interval I N . In the testingphase, we iterate over the testing data, making a predictionas each new state occurs, simulating a real-time implementa-tion. At each state change, we not only predict the next state,but also predict the holding time of the current state. Wecalculate the term H S i ( t x ) from Equation 9 for all t x ∈ [ I ] ,estimating a set of probabilities representing the likelihoodof the holding time lying within each interval in [ I ] . We takethe interval with the highest probability as the holding timeinterval for the current state. As we make the predictionimmediately after a state change is observed, the time so farspent in each state when we make the prediction is zero andis not taken into account. In our current implementation, ifwe predict, for example, that the holding time will lie ininterval , it means that the next transition is predicted tooccur after to minutes(refer to Table 4). We analyse theerror in the holding time prediction for each state in termsof the number of intervals that the prediction was off: if wepredicted the holding time for a state to lie in interval ,but the actual state occurred in interval , the error will be P redicted V alue − Actual V alue = − , indicating that ourprediction was intervals early. Positive prediction errorswill indicate that we predicted a transition would occur laterthan it actually occurred. We now present the results for predicting attacks with thesingle-order Markov chain. The accuracy analysis of theexperiment is shown in the ﬁrst row of Table 8. We achievea very high percentage ( . ) of true positive predictionswith a very low false alarm rate of . ; i.e. only . of thetime when we made an attack prediction, the attack did notoccur as the immediate next transition. In this experiment,we achieve an overall accuracy of . .The earlier that predictions of future attacks can bemade, the more useful they are. Fig. 2 plots the frequencydistribution of the time that elapsed after each correct attackprediction until the Attack state actually occurred as a CDF.The x-axis is log-scaled to magnify the variation in valuessmaller than minutes. In of cases, the warning timeis less than minute, i.e. after the majority of predictions,less than one minute elapses before an attack is seen. Inthe remaining of the cases, the warning time rangesfrom one minute and up to minutes. While in anautomated deployment, a full minute is sufﬁcient to takedefensive measures against hosts soon to engage in attacks,warning times of only a few seconds may be insufﬁcient toapprehend attacks. In such scenarios, our attack predictionis not meeting the desired objective of having ample earlywarning to allow for selective defensive measures; we nowinvestigate the possibility of increasing the warning time bymaking the prediction even earlier. We now investigate whether it is possible to generate attackwarnings signiﬁcantly earlier by predicting states that arevery likely to precede the Attack state. From the transi-tion matrix T (cid:48) from Section 6, we can see that the C&CCommunication State, represented by the third row, is themost likely ( . ) to precede attacks. We hypothesise thatpredicting the C&C Communication state (instead of attack)would allow for earlier warning of attacks. In the remainderof this section, we investigate (a) how accurately we canpredict the C&C Communication state, (b) how often thepredicted C&C Communication state is actually precedingthe Attack state, and (c) how much of an earlier warning wecan get for attacks when predicting C&C Communicationinstead of predicting the Attack state directly.As before, we predict the next state to be C&C Com-munication when it is the most likely outgoing transition,achieving . % true positives and . % false positives(Table 8, row 2). Further analysis of the true positivesshows that of the time the C&C Communication statewas predicted, it was indeed followed by the Attack state,demonstrating that it is a good indicator of future attacks.Finally, Fig. 3 plots the distribution of the time differencebetween predicting the C&C Communication state and theoccurrence of the Attack state. It shows that the proportionof the cases where we have under a minute of warningreduces from when we were only predicting the Attackstate to now that we are able to predict the C&CCommunication state. While we acknowledge that furtherimprovement is needed, the results do validate our hypoth-esis that predicting further back in the state sequence canincrease the warning time before an attack. We hypothesisethat for even earlier warning, we can predict the state thatmost likely leads to the C&C Communication state. How-ever, in our current dataset, we only see occurrencesof the Binary Download state and of the Exploit state,compared to nearly occurrences of both the attack andC&C states. Clearly this dataset is insufﬁcient to investigatethe temporal advantage of predicting these states. Comparison with BotHunter:

As no current publicly available tool predicts botnet attacks,we are unable to directly compare our results with anyother solutions. However, Figures 2 and 3 actually representthe temporal advantage offered by our model compared toBotHunter, i.e. the CDF of the time, in minutes, that we areahead of the time that BotHunter ﬁrst becomes aware ofan attack. BotHunter processes alerts from the Snort IDS(same as our system) and correlates them to identify botnetinfections in the network. The earliest time that an attackcan become known in a network monitored by BotHunteris when a Snort alert is issued after seeing an attack. Ourmodel on the other hand attempts to predict the attackbased on past behaviour. Thus, our system will either beunable to predict attacks and see them for the ﬁrst timewhen a Snort alert is generated (same as BotHunter), orpredict them earlier than the ﬁrst Snort alert for them; thetime advantage from our system will necessarily be ≥ and we can never be late in becoming aware of an attackcompared to BotHunter. TABLE 8: Accuracy analysis of predicting the Attack and C&C Communication states.

State True Positives False Positives True Negatives False Negatives Accuracy

Attack 98.3% 1.3% 98.7% 1.7% 98.5%C&C Communication 99.8% 1.7% 98.3% 0.02% 99% F ( T i m e ) Fig. 2: CDF of minutes between an attack predic-tion and the attack itself. F ( T i m e ) Fig. 3: CDF of minutes between a C&C Commu-nication prediction and an attack.

We now extend our single-order Markov chain to higherorders and compare the difference in accuracy. Fig. 4 showsthe true positive rate (TPR) and false positive rate (FPR) ofattack prediction using Markov chains of order 1 to order9. We observe that from order 1 to order 2, the TPR in-creases from 98.3% to 99.2%, and for each subsequent higherorder, the TPR continues to increase, approaching nearlya hundred percent at order 9 (no further improvement inaccuracy was observed for higher orders). However, theimprovement in TPR happens at the cost of a continualincrease in FPR, and as Fig. 5 shows, the highest overallaccuracy of 98.6% is achieved by the order-1 Markov chain.Although the decrease in accuracy for higher orders is veryslight, and ﬂuctuates rather than following an exactly lineartrend, our key observation is that the accuracy of the order-1chain is not achieved by any higher order.From this investigation, we conclude that the higher-order chain approach does not add a signiﬁcant beneﬁt forthe botnet attack prediction problem in practice. In fact,our results seem to corroborate the conclusion of earlierstudies which suggest that a ﬁrst-order Markov chain hascomparable accuracy to higher-order chains in the intrusiondetection domain [28]. However, because we do observe aclear improvement in TPR that grows with the order of thechain, we argue that some settings can beneﬁt from higher-order chains. For example, in safety-critical networks whereattacks are to be strictly prevented, full detection may bedesired even at the cost of increased false alarms. It is alsoimportant to consider the run-time performance of a high-order chain compared to a single-order chain. As the sizeof a Markov chain grows as a function of the order of thechain (in general, an m th -order Markov chain with N stateshas ( N − N m parameters), the computational complexityof training and predicting with a higher-order chain growsexponentially as the order increases. We discuss the run-time performance of higher-order chains on our dataset inSection 8. We now show the results of predicting the holding timeusing our semi-Markov chain in Fig. 6 and 7. Each ﬁgureshows a CDF of the prediction error in terms of the numberof intervals that our time predictions were off. Fig 6 plotsthe average error in predicting holding times for all states,and the errors for each state individually. Table 9 shows thefrequency distribution corresponding to Fig 6, for a closerexamination of the holding time prediction errors for eachstate. The best possible achievable result is that a hundredpercent of the predictions have zero error, i.e. the next stateactually occurred within the time interval that we predicted.Our result shows plots that lie close to this theoreticalmaximum. The plot for the Binary Download state in Fig. 6shows 100% correct predictions of holding time and thatof the Attack state shows 99% correct predictions (as thecumulative frequency up to the error value − is . , oronly %). The proportion of correct predictions for holdingtimes of the C&C Communication and Exploit states followclosely, at 96% and 94% respectively. On average, we make97% accurate holding time predictions. Table 9 shows thenumber of observations of each state for which holdingtimes were incorrectly predicted: for all states, the errorvalue − has the maximum frequency – i.e. most of ourincorrect predictions were only interval (i.e. at most minutes) early compared to the actual occurrence time ofthe transition. However, even this small prediction erroroccurred in a very small proportion of instances.Fig 7 shows the holding time prediction errors only forthose states that precede the Attack state. This is importantbecause predicting the holding time accurately is most im-portant when the next transition is to the Attack state, as ittells us how much time we have to contain the upcomingattacks. The ﬁgure shows that the cumulative frequenciesof the error values − and − are zero; thus, the worst-case error is intervals. The cumulative frequency of theerror value − is 0.03, indicating a total error of 3%, i.e.97% accuracy in predicting the holding time of the states TABLE 9: Frequency distribution of error in holding time prediction for each state.

Exploit Binary Download C&C Communication Attack

Error Frequency Error Frequency Error Frequency Error Frequency − − − − − − − − − − − − − − − − − − − − − − − − − − − − T r u e P o s iti v e R a t e Order 1Order 2Order 3Order 4Order 5Order 6Order 7Order 8Order 9

Fig. 4: TPR and FPR of prediction with higher-order models. A cc u r ac y ( % ) Fig. 5: Accuracy of prediction with higher-ordermodels. preceding attack, or in other words, predicting the time ofoccurrence of the Attack state. Across all our results, weobserve that the prediction error never has a positive value,which would indicate that transitions occurred earlier thanwe predicted them to occur; that is, in the small proportionof instances where our prediction is incorrect, we are pre-dicting transitions early rather than late. This implies thatwhenever the next state is predicted to be an attack, the timeprediction may lead to unnecessary preventative measuresbut will never lead to a delayed response that allows attacksto occur before the preventative measures are taken.Although the time estimation for our dataset is veryaccurate, a limitation in our current dataset or IDS limitsthe usefulness of this time prediction capability. Section 6showed that most attack and C&C communication events inthe training data occur within the ﬁrst interval (i.e., withina minute of previous events). This is reﬂected in the learned F ij ( t ) distribution, which is used to make holding timepredictions. As a consequence, the semi-Markov chain isusually predicting the ﬁrst interval ( < t ≤ ) as the timewhen the next transition will occur, which is insufﬁcienttime for manual investigation and in practice would lead toan immediate quarantine response to attack threat. Thus, theadvantage of ﬂexibly deciding whether to take a manual orauto-blocking response is lost. However, we argue that thisis an artefact of the data or IDS used in our experiments,where attack events closely follow C&C communicationevents. If for example the IDS was generating C&C com-munication alerts only once at the beginning of an instanceof C&C communication instead of repeatedly, then attackswould likely be spaced farther apart from the last instanceof C&C communication. Similarly, some bots could delib- erately inject delays before generating attacks in responseto C&C instructions, in order to remain stealthy. Thus, witha different dataset or IDS, the time distribution could varyconsiderably, and the prediction would not always be theminimum interval, making it important to learn it from datawhen predicting the times of attacks. Therefore, despite thislimitation, we believe that the semi-Markov chain moduleshould still be implemented within a real deployment ofour approach. Even if real trafﬁc is similar to our dataset,the semi-Markov chain does not add any signiﬁcant com-putational overhead at run-time (as shown in Section 8) anddoes not affect the accuracy of state prediction. RACTICAL I SSUES

Feasibility:

We believe our approach is feasible to deploy inpractice because it requires no prior knowledge of infectionof monitored hosts; any standard IDS can be deployed overlive trafﬁc and the Markov model (previously trained onlabelled IDS output) can run over the IDS event stream.Because of the simplicity of the approach, it is easy to re-train the model or update it with new states. Re-training ona new dataset can be performed ofﬂine and the previouslytrained model can simply be replaced without any changesin the IDS or prediction models. Updating the model withnew states (for example adding more types of attack states)is also simple; the IDS would be updated or replaced (forexample, new rules can be added to Snort to allow detectingthe new states), and the training and prediction moduleswould only require minor ofﬂine re-programming to allowfor a changed number of states.

Run-time Performance:

Markov chains’ training time de-pends on the size of the training data and the order of TABLE 10: Analysis of runtime performance of single-order, higher-order and semi-Markov chains.

Model Training Time (ms)

Per-prediction Time ( µs )Order-1 Markov Chain 21 0.15Order-9 Markov Chain 3861 0.19Semi-Markov Chain 246 2.55 −7 −6 −5 −4 −3 −2 −1 000.20.40.60.81 Error (Number of Intervals) F ( E rr o r) OverallExploitBinary DownloadC&C CommunicationAttack

Fig. 6: CDF of time prediction error for all fourstates. −7 −6 −5 −4 −3 −2 −1 000.20.40.60.81 Error (Number of Intervals) F ( E rr o r) Fig. 7: CDF of time prediction error for statespreceding attacks. the chain, as the number of transition probabilities to belearned grows as a function of the order of the chain andnumber of states. To analyse training time on our dataset,we have trained an order-1 chain, an order-9 chain, and asemi-Markov chain on our complete dataset. Table 10 showsthat the training phase for the order-1 chain and the semi-Markov chain can be completed very quickly (in 21 and 246milliseconds respectively) because of the simplicity of the4-state model, while higher-order chains take longer to train(about 3.9 seconds for our order-9 chain). A single predictionwith all three model variations is fast enough ( < µs ) thatthey can operate live even in high speed networks. Robustness to Deliberate Evasion:

An attacker may intro-duce random delays in compromised hosts’ communicationor vary the order of the malicious actions, for example byinserting a binary download between the C&C communica-tion and attack phases of a bot binary. While random delayswill not affect the Markov chain’s attack prediction, which istime-independent, they will affect the semi-Markov chain’stime estimates. Varying the order of the malicious actionswill clearly also degrade the prediction accuracy of themodel as the live trafﬁc will no longer match the trafﬁc themodel was trained on. While designing defences is beyondour current scope, one option is to design a feedback loopinto the system, so that live trafﬁc streaming into the systemis fed back into the training module after a ﬁxed short timeperiod, for example every hour, when it has been labelledby the IDS, updating the model to reﬂect the latest maliciousbehaviour.

ONCLUSION AND F UTURE W ORK

We have presented an approach for predicting the futurebehaviour of botnet infections, hypothesising that botnetattacks can be predicted using the malicious behaviourthat an infected host will engage in prior to launching anattack. We have designed a single-order Markov chain, a setof higher-order Markov chains, and a semi-Markov chain to capture this behavioural sequence and instantiated themodels using real-world botnet trafﬁc. All models showedhigh prediction accuracy, with over of attacks predictedaccurately and a worst-case FPR of . by the single-orderMarkov chain, and % accurate predictions of the timesof future attacks by the semi-Markov chain; the latter is anovel capability that in practice can allow building a time-sensitive threat response system in which the response toattack warnings can be tailored according to the amountof time available. We also showed that the computationallysimplest single-order chain achieves higher overall predic-tion accuracy compared to higher orders.We acknowledge some limitations in our current work.As acknowledged in Section 4.2, the range of maliciousbehaviour that we could cover in our evaluation was limitedby currently available intrusion detection tools. While ourwork has served to validate the proposed approach, repli-cating our experiments with the complete state model ofFig. 1 remains for future work. Secondly, the type of attackcurrently has no bearing on the prediction; both spam andport scan are mapped to the same state in the model. With adataset including more attack types, we would like to mapdifferent attacks to different states and investigate whethereach attack is preceded by unique behavioural patterns. Fi-nally, an investigation of adversarial attacks possible againstthe system and counter-measures is also left as future work. R EFERENCES [6] Credit Card Breach at Kmart Stores. . https://krebsonsecurity.com/2017/05/credit-card-breach-at-kmart-stores-again/, 2017.[Online; accessed 03-Jul-2017].[7] Emerging Threats Outgoing Scan Rule Documentation. . http://doc.emergingthreats.net/2001581, 2017. [Online; accessed 18-Feb-2018].[8] M. Abu Rajab, J. Zarfoss, F. Monrose, and A. Terzis. A multifacetedapproach to understanding the botnet phenomenon. In Proceedingsof the 6th ACM SIGCOMM conference on Internet measurement , pages41–52. ACM, 2006.[9] V. Barbu and N. Limnios. Semi-markov chains. In

Semi-MarkovChains and Hidden Semi-Markov Models toward Applications , pages1–32. Springer, 2008.[10] E. B. Beigi, H. H. Jazi, N. Stakhanova, and A. A. Ghorbani. To-wards effective feature selection in machine learning-based botnetdetection approaches. In

Communications and Network Security(CNS), 2014 IEEE Conference on , pages 247–255. IEEE, 2014.[11] J. Borges and M. Levene. Evaluating variable-length markovchain models for analysis of user web navigation sessions.

IEEETransactions on Knowledge and Data Engineering , 19(4):441–452, 2007.[12] M. Deshpande and G. Karypis. Selective markov models for pre-dicting web page accesses.

ACM Transactions on Internet Technology(TOIT) , 4(2):163–184, 2004.[13] J. M. Estevez-Tapiador, P. Garc´ıa-Teodoro, and J. E. D´ıaz-Verdejo.Detection of web-based attacks through markovian protocol pars-ing. In , pages 457–462. IEEE, 2005.[14] D. S. Fava, S. R. Byers, and S. J. Yang. Projecting cyberattacksthrough variable-length markov models.

IEEE Transactions onInformation Forensics and Security , 3(3):359–369, 2008.[15] W. Gobel. Detecting botnets using hidden markov models onnetwork traces.[16] G. Gu, R. Perdisci, J. Zhang, W. Lee, et al. Botminer: Clusteringanalysis of network trafﬁc for protocol-and structure-independentbotnet detection. In

USENIX Security Symposium , pages 139–154,2008.[17] G. Gu, P. A. Porras, V. Yegneswaran, M. W. Fong, and W. Lee.Bothunter: Detecting malware infection through IDS-driven dialogcorrelation. In

USENIX Security , volume 7, pages 1–16, 2007.[18] O. Haq, Z. Abaid, N. Bhatti, Z. Ahmed, and A. Syed. SDN-inspired, real-time botnet detection and ﬂow-blocking at ISP andenterprise-level. In

Communications (ICC), 2015 IEEE InternationalConference on , pages 5278–5283. IEEE, 2015.[19] T. Hirayama, K. Toyoda, and I. Sasase. Fast target link ﬂoodingattack detection scheme by analyzing traceroute packets ﬂow. In

Information Forensics and Security (WIFS), 2015 IEEE InternationalWorkshop on , pages 1–6, Nov 2015.[20] P. Holgado, V. A. VILLAGRA, and L. Vazquez. Real-time mul-tistep attack prediction based on hidden markov models.

IEEETransactions on Dependable and Secure Computing , 2017.[21] W.-H. Ju and Y. Vardi. A hybrid high-order markov chain modelfor computer intrusion detection.

Journal of Computational andGraphical Statistics , 10(2):277–295, 2001.[22] E. Kidmose. Botnet detection using Hidden Markov Models.Master’s thesis, Aalborg University, Denmark, 2014.[23] E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristofaro,G. Ross, and G. Stringhini. Mamadroid: Detecting android mal-ware by building markov chains of behavioral models. arXivpreprint arXiv:1612.04433 , 2016.[24] V. Paxson. Bro: a system for detecting network intruders in real-time.

Computer networks , 31(23):2435–2463, 1999.[25] A. E. Raftery. A model for high-order markov chains.

Journal ofthe Royal Statistical Society. Series B (Methodological) , pages 528–539,1985.[26] M. Roesch et al. Snort: Lightweight intrusion detection for net-works. In

LISA , volume 99, pages 229–238, 1999.[27] M. Salehi and M. Amini. Android malware detection usingmarkov chain model of application behaviors in requesting systemservices. arXiv preprint arXiv:1711.05731 , 2017.[28] M. Schonlau, W. DuMouchel, W.-H. Ju, A. F. Karr, M. Theus, andY. Vardi. Computer intrusion: Detecting masquerades.

Statisticalscience , pages 58–74, 2001.[29] S. S. Silva, R. M. Silva, R. C. Pinto, and R. M. Salles. Botnets: Asurvey.

Computer Networks , 57(2):378–403, 2013.[30] U. S. K. Thanthrige, J. Samarabandu, and X. Wang. Intrusionalert prediction using a hidden markov model. arXiv preprintarXiv:1610.07276 , 2016. [31] G. K. Venkatesh, V. Srihari, R. Veeramani, R. Karthikeyan, andR. Anitha. Http botnet detection using hidden semi-markov modelwith snmp mib variables.

International Journal of Electronic Securityand Digital Forensics , 5(3-4):188–200, 2013.[32] Y. Xie and S.-Z. Yu. A large-scale hidden semi-markov modelfor anomaly detection on user browsing behaviors.

IEEE/ACMTransactions on Networking (TON) , 17(1):54–65, 2009.[33] K. Xu, K. Tian, D. Yao, and B. G. Ryder. A sharper sense ofself: Probabilistic reasoning of program behaviors for anomalydetection with context sensitivity. In

Dependable Systems and Net-works (DSN), 2016 46th Annual IEEE/IFIP International Conferenceon , pages 467–478. IEEE, 2016.[34] N. Ye et al. A markov chain model of temporal behavior foranomaly detection. In

Proceedings of the 2000 IEEE Systems, Man,and Cybernetics Information Assurance and Security Workshop , vol-ume 166, page 169. West Point, NY, 2000.[35] N. Ye, X. Li, Q. Chen, S. M. Emran, and M. Xu. Probabilistictechniques for intrusion detection based on computer audit data.

IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systemsand Humans , 31(4):266–274, 2001.

Zainab Abaid received her Ph.D. from TheUniversity of New South Wales, Australia andis currently engaged in a research role withCSIRO Data61, Australia. Her research interestsinclude malware detection and adversarial ma-chine learning.

Dilip Sarkar (SM’96) received his Ph.D. from theUniversity of Central Florida. He is an AssociateProfessor of Computer Science at the Universityof Miami, Coral Gables. His research interestsinclude VBR video trafﬁc modeling, middlewareand Web computing, and parallel and distributedprocessing. He is a senior member of the IEEE,a member of IEEE Computer Society and theAssociation for Computing Machinery

Mohamed Ali Kaafar received his Ph.D. fromINRIA Sophia Antipolis and is a Full Professorat the Faculty of Science and Engineering atMacquarie University, the scientiﬁc Director ofthe Optus-Macquarie University Cyber SecurityHub, and the group leader of the InformationSecurity and Privacy research group at CSIROData61. He is the associate editor of ACM Trans-actions on Modeling and Performance Evalua-tion of Computing Systems (ACM Tompecs) andhis research interests include Privacy Preserv-ing Technologies, Networks Security, malware detection and AppliedCryptography.