[PDF] Active Deception using Factored Interactive POMDPs to Recognize Cyber Attacker's Intent

Abstract

This paper presents an intelligent and adaptive agent that employs deception to recognize a cyber adversary's intent. Unlike previous approaches to cyber deception, which mainly focus on delaying or confusing the attackers, we focus on engaging with them to learn their intent. We model cyber deception as a sequential decision-making problem in a two-agent context. We introduce factored finitely nested interactive POMDPs (I-POMDPx) and use this framework to model the problem with multiple attacker types. Our approach models cyber attacks on a single honeypot host across multiple phases from the attacker's initial entry to reaching its adversarial objective. The defending I-POMDPx-based agent uses decoys to engage with the attacker at multiple phases to form increasingly accurate predictions of the attacker's behavior and intent. The use of I-POMDPs also enables us to model the adversary's mental state and investigate how deception affects their beliefs. Our experiments in both simulation and on a real host show that the I-POMDPx-based agent performs significantly better at intent recognition than commonly used deception strategies on honeypots.

Full PDF

AActive Deception using Factored Interactive POMDPsto Recognize Cyber Attacker’s Intent

Aditya Shinde

Institute for AIUniversity of Georgia, AthensGA 30602 [email protected]

Prashant Doshi

Institute for AI & Dept. of Computer ScienceUniversity of Georgia, AthensGA 30602 [email protected]

Omid Setayeshfar

Dept. of Computer ScienceUniversity of Georgia, AthensGA 30602 [email protected]

Abstract

This paper presents an intelligent and adaptive agent that employs deceptionto recognize a cyber adversary’s intent. Unlike previous approaches to cyberdeception, which mainly focus on delaying or confusing the attackers, we focus onengaging with them to learn their intent. We model cyber deception as a sequentialdecision-making problem in a two-agent context. We introduce factored ﬁnitely-nested interactive POMDPs (I-POMDP X ) and use this framework to model theproblem with multiple attacker types. Our approach models cyber attacks on asingle honeypot host across multiple phases from the attacker’s initial entry toreaching its adversarial objective. The defending I-POMDP X -based agent usesdecoys to engage with the attacker at multiple phases to form increasingly accuratepredictions of the attacker’s behavior and intent. The use of I-POMDPs also enablesus to model the adversary’s mental state and investigate how deception affectstheir beliefs. Our experiments in both simulation and on a real host show that theI-POMDP X -based agent performs signiﬁcantly better at intent recognition thancommonly used deception strategies on honeypots. An important augmentation of conventional cyber defense utilizes deception-based cyber defensestrategies [16]. These are typically based on the use of decoy systems called honeypots [21] withadditional monitoring capabilities. Currently, honeypots tend to be passive systems with the purpose ofconsuming the attacker’s CPU cycles and time, and possibly logging the attacker’s actions. However,the information inferred about the attackers’ precise intent and capability is usually minimal.On the other hand, honeypots equipped with ﬁne-grained logging abilities offer an opportunity to bet-ter understand attackers’ intent and capabilities. We may achieve this by engaging and manipulatingthe attacker to perform actions that reveal his or her true intent. One way of accomplishing this isto employ active deception. Active strategies entail adaptive deception which seeks to inﬂuence theattackers’ beliefs and manipulates the attackers into performing desired actions [11]. We investigatehow multi-agent decision making can be used toward automating adaptive deception strategies tobetter understand the attacker.

Preprint. Under review. a r X i v : . [ c s . M A ] J u l e represent cyber deception on a single host as a decision-making problem between a defenderand an attacker. We introduce a factored variant of the well-known interactive partially observableMarkov decision process [9], labeled as I-POMDP X , to computationally model the decision makingof the defender while reasoning about the attacker’s beliefs and capabilities as it acts and observes.I-POMDP X exploits the factored structure of the problem, representing the dynamics and observationfunction using algebraic decision diagrams, and solving the model using a method that directlyoperates on these factored representations [1]. This brings some level of tractability to an otherwiseintractable framework, sufﬁcient to adequately solve the cyber deception domain. I-POMDP X explicitly models the beliefs of the attacker and the defender throughout the interaction. This allowsfor detailed inferences about how speciﬁc deceptive actions affect the attacker’s subjective view ofthe system. We evaluate the performance of I-POMDP X in promoting active deception with multipleattacker types both in simulation and on a real host. Our results show that the I-POMDP-based agentlearns the intent of the attacker much more accurately compared to baselines that do not engage theattacker or immediately deploy all decoys en masse. Interactive POMDPs (I-POMDPs) are a generalization of POMDPs to sequential decision-making inmulti-agent environments [9, 4]. Formally, an I-POMDP for agent i in an environment with one otheragent j is deﬁned as, I-POMDP i = (cid:104) IS i , A, T i , Ω i , O i , R i (cid:105) IS i denotes the interactive state space. This includes the physical state S as well as models of theother agent M j , which may be intentional or subintentional [3]. In this paper, we ascribe intentionalmodels to the other agent as they model the other agent’s beliefs and capabilities as a rational agent. A = A i × A j is the set of joint actions of both agents. T i represents the transition function, T i : S × A × S −→ [0 , . The transition function is deﬁned over the physical states and excludes theother agent’s models. This is a consequence of the model non-manipulability assumption – an agent’sactions do not directly inﬂuence the other agent’s models. Ω i is the set of agent i ’s observations. O is the observation function, O i : S × A × Ω −→ [0 , . The observation function is deﬁned over thephysical state space only as a consequence of the model non-observability assumption – other’s modelparameters may not be observed directly. R i deﬁnes the reward function for agent i , R i : S i × A −→ R .The reward function for I-POMDPs usually assigns utilities to the other agent’s physical states.We limit our attention to a ﬁnitely nested I-POMDP, in which the interactive state space IS i,l atstrategy level l is deﬁned bottom up as, IS i, = S, Θ j, = {(cid:104) b j, , ˆ θ j (cid:105) : b j, ∈ ∆( IS j, ) } IS i, = S × M j, , Θ j, = {(cid:104) b j, , ˆ θ j (cid:105) : b j, ∈ ∆( IS j, ) } ... IS i,l = S × M j,l − , Θ j,l = {(cid:104) b j,l , ˆ θ j (cid:105) : b j,l ∈ ∆( IS j,l ) } . Above, ˆ θ j represents agent j ’s frame, deﬁned as ˆ θ j = (cid:104) A j , Ω j , T j , O j , R j , OC j (cid:105) . Here, OC j represents j ’s optimality criterion and the other terms are as deﬁned previously. Θ j is the set of agent j ’s intentional models, deﬁned as θ j = (cid:104) b j , ˆ θ j (cid:105) . The interactive state space is typically restricted to aﬁnite set of j ’s models, which are updated after every interaction to account for the belief update ofagent j . The interactive state space for agent i at level l can be then deﬁned as, IS i,l = S × Reach (Θ j,l − , H ) , Θ j,l = {(cid:104) b j,l , ˆ θ j (cid:105) : b j,l ∈ ∆( IS j,l ) } . Here,

Reach (Θ j,l − , H ) is the set of level l − models that j could have in H steps; Reach (Θ j,l − ,

0) = Θ j,l − . We obtain Reach () by repeatedly updating j ’s beliefs in the models in Θ j,l − . Engaging and deceiving human attackers into intruding controlled systems and accessing obfuscateddata offers a proactive approach to computer and information security. It wastes attacker resourcesand potentially misleads the attacker. Importantly, it offers an untapped opportunity to understand2ttackers’ beliefs, capabilities, and preferences and how they evolve by sifting the detailed activitylogs. Identifying these mental and physical states not only informs the defender about the attacker’sintent, but also guides new ways of deceiving the attacker. In this section, we ﬁrst introduce ourdomain of cyber deception and subsequently discuss how it can be modeled in a factored I-POMDP.

The cyber deception domain models the interactions between the attacker and the defender on asingle honeypot host system. A state of the interaction is modeled using 11 state variables deﬁninga total of 4,608 states. Table 1 brieﬂy summarizes the state space. The

S_DATA_DECOYS and

C_DATA_DECOYS state variables represent the presence of sensitive data decoys and critical datadecoys. The

HOST_HAS_DATA variable represents the true type of valuable data on the system. Weassume that a system cannot have two different types of valuable data simultaneously. This is areasonable assumption because usually different hosts on enterprise networks possess different assets.We differentiate between sensitive_data and critical_data as distinct targets. Sensitive data,for example, includes private data of employees, high ranking ofﬁcials, or any data that the attackerwould proﬁt from stealing. Also, in practical scenarios, honeypots never contain any real valuable data.Consequently, in the cyber deception domain in this paper, the

HOST_HAS_DATA is none . However,the attacker is unaware of the honeypot or the data decoys and hence forms a belief over this statevariable. Thus, the HOST_HAS_DATA variable gives a subjective view of the attacker being deceived.Table 1: The state of the cyber deception domain is comprised of 11 variables.

State Variable Name Values Description

PRIVS_DECEPTION user, root, none

Deceptive reporting of privileges

S_DATA_DECOYS yes, no

Presence of sensitive data decoys

C_DATA_DECOYS yes, no

Presence of critical data decoys

HOST_HAS_DATA sensitive_data,

Type of valuable data critical_data, none on the system

DATA_ACCESS_PRIVS user, root

Privileges required to access or ﬁnd data

ATTACKER_PRIVS user, root

Attacker’s highest privileges

DATA_FOUND yes, no

Valuable data found by the attacker

VULN_FOUND yes, no

Local

PrivEsc discovered by attacker

IMPACT_CAUSED yes, no

Attack successful

ATTACKER_STATUS active, inactive

Presence of attacker on the host

HOST_HAS_VULN yes, no

Presence of local

PrivEsc vulnerabilityThere are 5 observation variables for the attacker which make a total of 48 unique observations. Weinclude three different types of attackers; the data exﬁl attacker, data manipulator and persistentthreat . The data exﬁl attacker represents a threat that aims to steal valuable private data from the host.The data manipulator attacker represents a threat that seeks to manipulate data that is critical for theoperation of a business or a physical target. Thus, the data exﬁl attacker targets sensitive_data inthe system and the data manipulator attacker targets critical_data . The persistent threat attackerwants to establish a strong presence in the system at a high privilege level.The attacker in the interaction can perform one of 9 actions to gather information about the system,manipulate the system, or take action on objectives. Table 2 brieﬂy summarizes the actions availableto the attacker. The

FILE_RECON_SDATA and

FILE_RECON_CDATA actions cause the

DATA_FOUND variable to transition to yes . The

FILE_RECON_SDATA action is slightly worse at ﬁnding data than the

FILE_RECON_CDATA . This reﬂects the fact that private sensitive information is slightly difﬁcult to ﬁndbecause it is often stored in user directories in arbitrary locations. On the other hand, critical data, likeservice conﬁguration or database ﬁles, are stored in well-known locations on the system. The attackergets information about the

DATA_FOUND transition through the

DATA observation variable. It simulatesthe data discovery phase of an attack.

VULN_RECON is another action that works similarly and causesthe

VULN_FOUND transition to yes . This transition depicts the attacker looking for vulnerabilities toraise privileges. Depending on the type of the attacker, the

START_EXFIL , MANIPULATE_DATA , or

PERSIST actions can be performed to achieve the attacker’s main objectives. We assume that theattacker is unable to discern between decoy data and real data, and hence, unable to determine whichvariable inﬂuences the

DATA_FOUND state transition during ﬁle discovery. The attacker, however, can3able 2: The actions available to the attacker.

Action name States affected Description

FILE_RECON_SDATA DATA_FOUND

Search for sensitive data for theft

FILE_RECON_CDATA DATA_FOUND

Search for critical data for manipulation

VULN_RECON VULN_FOUND

Search for local

PrivEsc vulnerability

PRIV_ESC ATTACKER_PRIVS

Exploit local

PrivEsc vulnerability

CHECK_ROOT none

Check availability of root privileges

START_EXFIL IMPACT_CAUSED

Upload critical data over network

PERSIST IMPACT_CAUSED

Establish a permanent presence in the system

MANIPULATE_DATA IMPACT_CAUSED

Manipulate stored data

EXIT ATTACKER_STATUS

Terminate the attackdistinguish between different types of valuable data. So, if the system contains data that is differentfrom what the attacker expects, the attacker can observe this from the

DISCREPANCY observationvariable. As

DATA and

DISCREPANCY are separate observation variables, the attacker can observe adiscrepancy even when data has been found. When this occurs, the attacker develops a belief over thedecoy data states as the host can have only one type of data. This realistically models a situation inwhich the attacker encounters multiple decoys of different types and suspects deception.The defender in the interaction starts with complete information about the system. The defender’sactions mostly govern the deployment and removal of different types of decoys. These actionsinﬂuence the

S_DATA_DECOYS and

C_DATA_DECOYS states. Additionally, the defender can inﬂuencethe attacker’s observations about his privileges through the

PRIVS_DECEPTION state. The defendergets perfect observations whenever the attacker interacts with a decoy. Additionally, the defendergets stochastic observations about the attacker’s actions through the

LOG_INFERENCE observationvariable. The attacker is rewarded for exiting the system after causing an impact. For the dataexﬁl and data manipulator attacker types, this is achieved by performing the

START_EXFIL and

MANIPULATE_DATA actions respectively. The persistent threat attacker is rewarded for getting rootlevel persistence in the system.Figure 1: The attacker starts with a low prior belief on the existence of decoys and an active defender.If decoys are indistinguishable from real data, the attacker attributes his observation to the existenceof real data even when the host has none.Figure 1 illustrates a scenario taken from an actual simulation run with the data manipulator attackertype. Initially, the attacker has a non-zero belief over the existence of data on the system. However,the true state of the system on the left shows that the system does not actually contain any data. In theabsence of the defender or any static data decoys, the attacker will eventually update his beliefs toaccurately reﬂect the reality by performing the

FILE_RECON_CDATA action and observing the result.However, to avoid this belief state, the defender deploys data decoys when the attacker acts. Theattacker’s inability to tell the difference between decoy data and real data and his prior belief about4he absence of decoys leads him to attribute his observations to the existence of real data leading tothe attacker being deceived. (a) Dynamics compactly represented as a two time-sliceDBN for select joint actions and observation variables. (b) An ADD representing the observation function P NOP ( S_DATA_DECOY_INTR’ |X (cid:48) , A j ) Figure 2: I-POMDP X representation of the cyber deception domain. Factored POMDPs have been effective toward solving structured problems with large state andobservation spaces [7, 17]. Motivated by this observation, we extend the ﬁnitely-nested I-POMDPreviewed in Section 2 to its factored representation, I-POMDP X . Formally, this extension is deﬁnedas: I-POMDP X = (cid:104)IS i , A, T i , Y i , O i , R i (cid:105)IS i is the factored interactive state space consisting of physical state factors X and agent j ’s models M j . In a ﬁnitely-nested I-POMDP X the set M j is bounded similarly to ﬁnitely-nested I-POMDPs.Action set A is deﬁned exactly as before. We use algebraic decision diagrams (ADDs) [1] to representthe factors for agent i ’s transition, observation, and reward functions compactly. T i deﬁnes thetransition function represented using ADDs as P a i ,a j ( X (cid:48) |X ) for a i ∈ A i and a j ∈ A j . Y i is theset of observation variables which make up the observation space. O i is the observation functionrepresented as ADDs, P a i ,a j ( Y (cid:48) i |X (cid:48) ) . R i deﬁnes the reward function for agent i . The reward functionis also represented as an ADD, R a i ,a j ( X ) .We illustrate I-POMDP X by modeling the cyber deception domain of Section 3.1 in the framework.Figure 2a shows the DBN for select state and observation variables given that the attacker engages inreconnaissance actions. The two slices in the DBN represent the sets of pre- and post-action state vari-ables, X = { X , ..., X n } and X (cid:48) = { X (cid:48) , ..., X (cid:48) n } where X n represents a single state variable. Simi-larly, Y (cid:48) i = { Y (cid:48) i , ..., Y (cid:48) i n } and Y (cid:48) j = { Y (cid:48) j , ..., Y (cid:48) j n } denote the sets of observation variables for agents i and j respectively. The ADD P a i ( X (cid:48) |X , A j ) = P a i ( X (cid:48) | X (cid:48) , . . . , X (cid:48) n , X , A j ) × ... × P a i ( X (cid:48) n |X , A j ) represents the complete transition function for action A i = a i . This is analogous to the completeaction diagram deﬁned by Hoey et al. [13] for MDPs. Similarly, the observation function is repre-sented using the ADD (Fig. 2b), P a i ( Y (cid:48) i |X (cid:48) , A j ) = P a i ( Y (cid:48) i |X (cid:48) , A j ) × ... × P a i ( Y (cid:48) i n |X (cid:48) , A j ) whichis analogous to the complete observation diagram [7]. Additionally, in an I-POMDP X , agent i also recursively updates the beliefs of agent j . The attacker types are modeled as frames in M j .Let M j = { m j : (cid:104) b j , ˆ θ j (cid:105) , ..., m j n : (cid:104) b j q , ˆ θ j r (cid:105)} be the set of all models in Reach (Θ j,l − , H ) .Because neither a j nor o j are directly accessible to agent i , they are represented as ADDs P ( A j | M j ) and P a i ( Y (cid:48) j |X (cid:48) , A j ) . The distribution over M (cid:48) j is then P a i ( M (cid:48) j | M j , Y (cid:48) j , A j , X (cid:48) ) = P a i ( M (cid:48) j | M j , A j , Y (cid:48) j ) × P a i ( Y (cid:48) j |X (cid:48) , A j ) . Using these factors, we can now deﬁne the distributionover X (cid:48) and M (cid:48) j given action a i and observation o i as a single ADD using existential abstraction: P a i ,o i ( M (cid:48) j , X (cid:48) | M j , X ) = (cid:88) A j , Y (cid:48) j P a i ,o i ( Y (cid:48) j , M (cid:48) j , X (cid:48) , A j | M j , X )= (cid:88) A j , Y (cid:48) j P a i ( X (cid:48) |X , A j ) P a i ( Y (cid:48) i |X (cid:48) , A j ) P ( A j | M j ) P a i ( M (cid:48) j | M j , A j , Y (cid:48) j , X (cid:48) ) . (1)5ere, the ADD P a i ( X (cid:48) |X , A j ) compactly represents T i ( s t − , a t − i , a t − j , s t ) , P a i ( Y (cid:48) i |X (cid:48) , A j ) represents the probabilities O i ( s t , a t − i , a t − j , o ti ) , P ( A j | M j ) represents P ( a t − j | θ t − j ) ,and P a i ( M (cid:48) j | M j , A j , Y (cid:48) j , X (cid:48) ) represents the recursive belief update transitions τ θ tj ( b t − j , a t − j , o tj , b tj ) O j ( s t , a t − i , a t − j , o tj ) of the original I-POMDP. Thus, the constructedADD P a i ,o i ( X (cid:48) , M (cid:48) j |X M j ) contains the transition probabilities for all interactive state variablesgiven action a i and observation o i . The I-POMDP X belief update can then be computed as: b a i ,o i i ( X (cid:48) , M (cid:48) j ) = (cid:88) X ,M j b ( X , M j ) × P a i ,o i ( X (cid:48) , M (cid:48) j |X , M j ) (2)where the ADD P a i ,o i ( X (cid:48) , M (cid:48) j |X , M j ) is obtained as in Eq. 1.Symbolic Perseus [17] offers a relatively scalable point-based approximation technique that exploitsthe ADD structure of factored POMDPs. Toward generalizing this technique for I-POMDP X , weare aided by the existence of point-based value iteration for I-POMDPs [5]. Subsequently, we maygeneralize the α -vectors and its backup from the latter to the factored representation of I-POMDP X : Γ a i , ∗ ←− α a i , ∗ ( X , M j ) = (cid:88) A j R a i ( X , A j ) P ( A j | M j )Γ a i ,o i ∪ ←− α a i ,o i ( X , M j ) = γ (cid:88) X (cid:48) ,M (cid:48) j P a i ,o i ( X (cid:48) , M (cid:48) j |X , M j ) α t +1 ( X (cid:48) , M (cid:48) j ) , ∀ α t +1 ∈ V t +1 Γ a i ←− Γ a i , ∗ ⊕ o i arg max Γ ai,oi ( α a i ,o i · b i ) , V t ←− arg max α t ∈ (cid:83) ai Γ ai ( α t · b i ) , ∀ b i ∈ B i (3)Here, V t +1 is the set of α -vectors from the next time step and b i is a belief point from the set ofconsidered beliefs B i . A popular way of building B i is to project an initial set of beliefs pointsforwards for H time steps using the belief update of Eq. 2. We modeled the full cyber deception domain described in Section 3.1 from the perspective of a level-1defender using the I-POMDP X framework. We implemented the generalized Symbolic Perseus usingthe point-based updates of the α -vectors and the belief set projection as given in Section 3.2, in orderto solve I-POMDP X . The solver has several enhancements such as cached ADD computations andADD approximations for additional speed up.We evaluate the deception policy generated by I-POMDP X in simulations and on an actual systemconsisting of a standalone attacker programmed via Metasploit [15] and a defender workstation. Wesimulate each attacker type using the optimal policy computed by the level-0 attacker POMDP. Weshow these policies for each type of attacker in the supplementary material. For the simulations,we randomly sample the frame and the starting privileges of the attacker to simulate a threat withunknown intentions and privileges. The defender begins knowing about the existence of decoys onthe system. The attacker, on the other hand, does not have prior knowledge about any vulnerabilitiesor data on the system. The defender engages with the attacker by deploying decoys, facilitatingdeceptive observations, or adding known vulnerabilities to the system. In the simulations, the statetransitions and observations for both agents are generated by sampling from the joint transitionfunctions and individual observation functions. Simulations

We compare the I-POMDP X policy against other passive baselines: one that doesnot engage and passively observes the attacker, and another which uses deception indiscriminatelyhaving deployed both sensitive and critical data decoys and all vulnerabilities in the honeypot at thebeginning. We label the ﬁrst baseline as NO-OP(no decoy) and the second baseline as NO-OP(alldecoys). We perform the simulations for 30 trials with an attacker type randomly picked in each trial.During each trial, the defender begins not knowing the type of the attacker and believes that the stateis that the attacker’s privileges are not known. We set H in Reach (Θ j,l − , H ) to 5. The generalizedSymbolic Perseus is then run on 200 projected belief points until convergence to obtain the policy,which prescribes the subsequent actions for the defender until the end of the trial. It converges inabout 6 minutes with a mean time per backup of 37 secs on Ubuntu 18 with Intel i7 and 64 GB RAM.6he NO-OP(no decoy) and NO-OP(all decoy) yielded a mean ( ± std err.) of 4.30 ± ± X agent engaged with the attacker for a mean duration of 5.90 ± (a) When engaging a defender-unaware attacker,I-POMDP X -based defender outperforms other pas-sive agents in engaging the attackers and recogniz-ing their intent (b) On the actual host deployment, the I-POMDP X -based agent uses implemented deception tech-niques to engage with the attacker for longer dura-tion than other agents Figure 3: Cross entropy (KL divergence) of the beliefs of the I-POMDP X agent and other baselinesin simulations. Cross entropies near zero signify good intent recognition.Do the extended engagements facilitated by the I-POMDP X agent help in intent recognition? Fig-ure 3a(a) shows the cross-entropy between the defender’s belief of the attacker’s frame and theattacker’s true type, as it varies across the steps of the interaction. The defender’s I-POMDP X policyeventually yields the lowest cross-entropy values compared to the baselines, often reaching zeroin 6 steps. We show the cross-entropy for more steps because the attacker remains in the systemperforming a few more actions. The sharp decrease in cross-entropy in the ﬁrst three steps is becausethe attacker’s decoy interactions (if the attacker is of type data exﬁl or manipulator ) are perfectlyobserved by the defender (some other interactions generate noisy observations). Multiple consecutivedata reconnaissance steps ﬁlter out the persistence attacker type, and the ﬁnal step of either exﬁltratingthe data or manipulating it allows the defender to distinguish between the remaining two attackertypes. But, for the NO-OP(no decoy) with no deception, the only source of information about the at-tacker is his general actions, which is noisy. Hence, such a defender is unable to form accurate beliefsbefore the attacker leaves the interaction. For the NO-OP(all decoy) agent that indiscriminately usesdeception, observations from decoy interactions are perfect, but the risk of the attacker encounteringcontradicting decoys and suspecting deception is also high leading to early exits. Host deployment

In our next phase of experimentation, we evaluated the real-world feasibility ofdeploying an operational I-POMDP X on a host system and testing its efﬁcacy. The testbed consistsof 3 separate hosts: the attacker , the adaptive honeypot and the defender . Figure 4 shows the overallarchitecture of our testbed implementation. The attacker system runs a Kali Linux distribution whichis well known for the variety of offensive and defensive cybersecurity tools that are preinstalled on it.7igure 4: System architecture of the testbed used to deploy the agents. The defender manipulates thesystem through decoys and commonly used coreutils binaries to give deviant observations.The adaptive honeypot on which the interaction takes place runs a Metasploitable 3 Linux distribution.This distribution has a wide range of builtin vulnerabilities and is commonly used to simulate victimworkstations in cyber attack simulations. The adaptive honeypot also contains an attacker agentthat executes the attacks and communicates with the attacker . The attacker agent implements theactions given by the attacker’s optimal plan located on the attacker host using realistic techniquescommonly used by real attackers. We implement real exploits to facilitate privilege escalation on thehost. The adaptive honeypot also has a defender agent that implements the defender’s actions andgets observations.The defender AI located on the defender workstation solves the I-POMDP X and computes theoptimal action. For implementing the observation function, the I-POMDP X agent monitors andanalyzes the system logs to get information about the attacker’s actions (i.e., observations). Toenable this, we use GrAALF [19], a graphical framework for processing and querying system calllogs. GrAALF analyzes system call logs in real-time and provides the stochastic LOG_INFERENCE observation variable values (pertaining to ﬁle and vulnerability searches) as well as the perfectlyobserved

DATA_DECOY_INTERACTION variable values to the defender.Our results in Fig. 3(b) show the adaptive deception strategy employed by the I-POMDP X agent isbetter at engaging adversaries on a honeypot as compared to the passive strategies that are commonlyused. While the cross entropy does not reach zero due to the challenge of accurately inferring theattacker’s actions from the logs (leading to noisier observations), it gets close to zero, which isindicative of accurate intent recognition. AI methods are beginning to be explored for use in cyber deception. An area of signiﬁcant recentinterest has been game-theoretic multi-agent modeling of cyber deception, which contrasts with thedecision-theoretic modeling adopted in this paper.Schlenker et al. [18] introduced cyber deception games based on Stackelberg games [20]. Thesemodel deception during the network reconnaissance phase when the attacker is deceived into intrudinga honeypot. Another similar approach [6] allocates honeypots in a network using a Stackelberg game.The game uses attack graphs to model the attacker and creates an optimal honeypot allocation strategyto lure attackers. Jajodia et al. [12] develop probabilistic logic to model deception during networkscanning. While these efforts focus on static deployment of deception strategies at the network level,we seek active deception at the host level – once the attacker has entered the honeypot. Further, wemodel individual phases of the attack in greater detail, which allows us to employ realistic deceptiontechniques at each phase.At the host level, Carroll et al. [2] models deception as a signaling game while Horak et al. [10] createsa model for active deception using partially observable stochastic games. However, both of thesetake a high-level view modeling defender actions rather abstractly. In contrast, our defender actionsare realistic and can be implemented on honeypots as demonstrated in Section 4. Ferguson-Walteret al. [8] model possible differences between the attacker’s and defender’s perceptions toward theinteraction by modeling cyber deception as a hypergame [14]. Hypergames model different views8f the game being played from the perspective of the players. While this approach similar to oursrepresents the attacker’s perspective of the game, we explicitly model the adversary using a subjectivedecision-theoretic approach and do not solve for equilibrium.

Our approach of utilizing automated decision making for deception to recognize attacker intentis a novel application of AI and decision making in cyber security. It elevates the extant securitymethods from anomaly and threat detection to intent recognition. We introduced a factored variant ofthe well-known I-POMDP framework, which exploits the environment structure and utilized it tomodel the new cyber deception domain. Our experiments revealed that the I-POMDP X -based agentsucceeds in engaging various types of attackers for a longer duration than passive honeypot strategies,which facilities intent recognition. Importantly, the agent is practical on a real system with loggingcapabilities paving the way for its deployment in actual honeypots.9 roader Impact On a broader scale, the I-POMDP X framework that we introduce makes I-POMDPs tractable tobe applied to larger problems. I-POMDPs are suitable for modeling multi-agent interactions dueto their ability to model opponents from the perspective of an individual. This has a multitude ofapplications like negotiations, studying human behavior, cognition, etc. Through our work, we hopeto make I-POMDPs tractable to be applied to such domains. Another area that we hope to motivatethrough our research is deception in human interactions. Modeling other agents explicitly will helpunderstand how deceptive or real information inﬂuences an individual’s beliefs. This has a wide rangeof potential applications such as studying how biases can be exploited, the effect of fake news onindividuals, and how individuals can detect deception. We hope our research will eventually motivatefurther research in areas like counter deception and deception resilience in agents.At an application level, our work aims to motivate the use of AI and decision making to createinformed cyber defense strategies. Our work provides a new perspective different from the traditionalaction-reaction dynamic that has deﬁned interactions between cyber attackers and defenders for years.Our framework models the opponent’s mental states and preferences. This will aid security teamsin understanding threats at a deeper level. We hope our framework will motivate the developmentof adaptive and intelligent deceptive solutions that can study and predict attackers at a deeper level.Understanding attackers’ mental models, inherent biases, and preferences will go a long way informing ﬂexible cyber defense strategies that can adapt to different threats. References [1] R. I. Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E. Macii, A. Pardo, and F. Somenzi.Algebric decision diagrams and their applications.

Formal methods in system design , 10(2-3):171–206, 1997.[2] T. E. Carroll and D. Grosu. A game theoretic investigation of deception in network security.

Security and Communication Networks , 4(10):1162–1172, 2011. ISSN 19390122. doi: 10.1002/sec.242.[3] D. Dennett. Intentional systems. brainstorms, 1986.[4] P. Doshi. Decision making in complex multiagent settings: A tale of two frameworks.

AIMagazine , 33(4):82–95, 2012.[5] P. Doshi and D. Perez. Generalized point based value iteration for interactive pomdps. In

AAAI ,pages 63–68, 2008.[6] K. Durkota, V. Lis`y, B. Bošansk`y, and C. Kiekintveld. Approximate solutions for attack graphgames with imperfect information. In

International Conference on Decision and Game Theoryfor Security , pages 228–249. Springer, 2015.[7] Z. Feng and E. A. Hansen. Approximate planning for factored pomdps. In

Sixth EuropeanConference on Planning , 2014.[8] K. Ferguson-Walter, S. Fugate, J. Mauger, and M. Major. Game theory for adaptive defensivecyber deception. In

Proceedings of the 6th Annual Symposium on Hot Topics in the Science ofSecurity , page 4. ACM, 2019.[9] P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multi-agent settings.

Journal of Artiﬁcial Intelligence Research , 24:49–79, 2005.[10] K. Horák, Q. Zhu, and B. Bošansk`y. Manipulating adversary’s belief: A dynamic gameapproach to deception by design for proactive network security. In

International Conference onDecision and Game Theory for Security , pages 273–294. Springer, 2017.[11] S. Jajodia, V. Subrahmanian, V. Swarup, and C. Wang.

Cyber deception . Springer, 2016.[12] S. Jajodia, N. Park, F. Pierazzi, A. Pugliese, E. Serra, G. I. Simari, and V. Subrahmanian.A probabilistic logic of cyber deception.

IEEE Transactions on Information Forensics andSecurity , 12(11):2532–2544, 2017. 1013] A. Jesse Hoey, R. S. Aubin, and C. Boutilier. Spudd: stochastic planning using decisiondiagrams.

Proceedings of Uncertainty in Artiﬁcial Intelligence (UAI). Stockholm, Sweden. Page(s) , 15, 1999.[14] N. S. Kovach, A. S. Gibson, and G. B. Lamont. Hypergame theory: a model for conﬂict,misperception, and deception.

Game Theory , 2015, 2015.[15] D. Maynor.

Metasploit toolkit for penetration testing, exploit development, and vulnerabilityresearch . Elsevier, 2011.[16] Pingree. Emerging Technology Analysis : Deception Techniques and Technologies CreateSecurity Technology Business Opportunities.

Trapx security , pages 1–18, 2018. doi: G0027834.[17] P. Poupart.

Exploiting structure to efﬁciently solve large scale partially observable Markovdecision processes . PhD thesis, University of Toronto, 2005.[18] A. Schlenker, O. Thakoor, H. Xu, L. Tran-Thanh, F. Fang, P. Vayanos, M. Tambe, and Y. Vorob-eychik. Deceiving cyber adversaries: A game theoretic approach.

Proceedings of the Interna-tional Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS , 2:892–900,2018. ISSN 15582914.[19] O. Setayeshfar, C. Adkins, M. Jones, K. H. Lee, and P. Doshi. Graalf: Supporting graphicalanalysis of audit logs for forensics. arXiv preprint arXiv:1909.00902 , 2019.[20] M. Simaan and J. J.B. Cruz. On the stackelberg strategy in nonzero-sum games.

Journal ofOptimization Theory and Applications , 11(5):533–555, 1973.[21] L. Spitzner. The honeynet project: Trapping the hackers.

IEEE Security & Privacy , 1(2):15–23,2003.

Appendix

Attacker Policies

We model the attackers using optimal policies of their level-0 POMDPs. For our problem, we deﬁnethree distinct types of attackers which are modeled as separate frames in the I-POMDP. Below wediscuss the optimal policies for each type.

The data exﬁl attacker frame

The data exﬁl type attacker is rewarded for stealing sensitive_data on the host. We model thistype based on threats that steal private data and other sensitive data from systems. The attacker startswith no knowledge of the existence of data on the system. We see that the optimal policy recommendsthe

FILE_RECON_SDATA action which simulates sensitive data discovery on computers. After failingto ﬁnd data after the ﬁrst few attempts, the attacker attempts to escalate privileges and search again.If the attacker encounters unexpected types of decoys, the attacker leaves since there is no rewardfor stealing data that is not sensitive. Also, the observation of discrepancies when data is found willinform the attacker about the possibility of deception. This is because the system only contains asingle type of data. On being alerted to the possibility of being deceived, the attacker leaves thesystem.

The data manipulator attacker frame

The data manipulator type attacker is rewarded for manipulating critical_data on the host. Thisattacker type is modeled after attackers that intrude systems to manipulate data that is critical fora business operation. Similar to the data exﬁl type, the attacker starts with no information aboutthe data. The optimal policy for this attacker type recommends

FILE_RECON_CDATA action in theinitial steps. Because critical data like service conﬁgurations or databases are usually stored in well-known locations, the

FILE_RECON_CDATA is modeled to ﬁnd critical_data quickly as comparedto sensitive data. In the subsequent interaction steps, the attacker escalates privileges to continue thesearch if data is not found in the initial steps. Like the data exﬁl attacker, the data manipulator alsoleaves the system on observing discrepancies, suspecting deception, or on failure to ﬁnd data.11igure 5: Optimal policy for data exﬁl type attackerFigure 6: Optimal policy for data manipulator type attacker

The persistent threat attacker frame

Figure 7: Optimal policy for persistent threat type attackerThe persistent threat type attacker aims to establish root level persistence on the host. Such attacksare common. Attackers establish a strong presence in an organization’s network and stay dormant for12n extended duration. For this attacker type, the policy consists of vulnerability discovery actionsin the initial steps. The attacker escalates privileges by performing the

PRIV_ESC action on ﬁndingvulnerabilities. Once the attacker has the required privileges, the

PERSIST action is performed tocomplete the objective.While all three attacker policies may seem signiﬁcantly different from their actions, the defender’sobservations of these actions are noisy. The errors in observation come from the noisy natureof real-time log analysis. For example, the

VULN_RECON action models vulnerability discoveryon a host. This action involves looking through the local ﬁle system for any vulnerable scripts,enumerating system information, listing services, etc. Thus a

VULN_RECON can be mistaken fora

FILE_RECON_CDATA or a

FILE_RECON_SDATA in real-time log analysis. Similarly, it is difﬁcultto tell the difference between the

FILE_RECON_CDATA and