[PDF] Spatial-Temporal Moving Target Defense: A Markov Stackelberg Game Model

Abstract

Moving target defense has emerged as a critical paradigm of protecting a vulnerable system against persistent and stealthy attacks. To protect a system, a defender proactively changes the system configurations to limit the exposure of security vulnerabilities to potential attackers. In doing so, the defender creates asymmetric uncertainty and complexity for the attackers, making it much harder for them to compromise the system. In practice, the defender incurs a switching cost for each migration of the system configurations. The switching cost usually depends on both the current configuration and the following configuration. Besides, different system configurations typically require a different amount of time for an attacker to exploit and attack. Therefore, a defender must simultaneously decide both the optimal sequences of system configurations and the optimal timing for switching. In this paper, we propose a Markov Stackelberg Game framework to precisely characterize the defender's spatial and temporal decision-making in the face of advanced attackers. We introduce a relative value iteration algorithm that computes the defender's optimal moving target defense strategies. Empirical evaluation on real-world problems demonstrates the advantages of the Markov Stackelberg game model for spatial-temporal moving target defense.

Full PDF

SSpatial-Temporal Moving Target Defense: A Markov StackelbergGame Model

Henger Li, Wen Shen, Zizhan Zheng

Department of Computer Science, Tulane University{hli30,wshen9,zzheng3}@tulane.edu

ABSTRACT

Moving target defense has emerged as a critical paradigm of pro-tecting a vulnerable system against persistent and stealthy attacks.To protect a system, a defender proactively changes the systemconfigurations to limit the exposure of security vulnerabilities topotential attackers. In doing so, the defender creates asymmetric un-certainty and complexity for the attackers, making it much harderfor them to compromise the system. In practice, the defender incursa switching cost for each migration of the system configurations.The switching cost usually depends on both the current configu-ration and the following configuration. Besides, different systemconfigurations typically require a different amount of time for anattacker to exploit and attack. Therefore, a defender must simultane-ously decide both the optimal sequences of system configurationsand the optimal timing for switching. In this paper, we proposea Markov Stackelberg Game framework to precisely characterizethe defender’s spatial and temporal decision-making in the face ofadvanced attackers. We introduce a relative value iteration algo-rithm that computes the defender’s optimal moving target defensestrategies. Empirical evaluation on real-world problems demon-strates the advantages of the Markov Stackelberg game model forspatial-temporal moving target defense.

KEYWORDS

Moving target defense; Stackelberg game; Markov decision process

Moving target defense (MTD) has established itself as a power-ful framework to counter persistent and stealthy threats that arefrequently observed in Web applications [31, 33], cloud-based ser-vices [11, 23], database systems [40], and operating systems [8]. Thecore idea of MTD is that a defender proactively switches configura-tions (a.k.a., attack surfaces) of a vulnerable system to increase theuncertainty and complexity for potential attackers and limits theresources (e.g., window of vulnerabilities) available to them [9]. Incontrast to MTD techniques, traditional passive defenses typicallyuse analysis tools to identify vulnerabilities and detect attacks.A major factor for a defender to adopt MTD techniques ratherthan passive defensive applications is that adaptive and sophisti-cated attackers often have asymmetric advantages of resources (e.g.,time, prior knowledge of the vulnerabilities) over the defender dueto the static nature of system configurations [20, 30]. For instance,clever attackers are likely to exploit the system over time and subse-quently identify the optimal targets to compromise without beingdetected by the static defensive applications [39]. In such case, a well-crafted MTD technique is more effective because it changesthe static nature of the system and increases the attackers’ com-plexity and cost of mounting successful attacks [6]. In this way, itreduces or even eliminates the attackers’ asymmetric advantagesof the resources.Despite the promising prospects of MTD techniques, it is chal-lenging to implement optimal MTD strategies. Two factors con-tribute to the difficult situation. On one hand, the defender mustmake the switching strategies sufficiently unpredictable becauseotherwise the attacker can thwart the defense. On the other hand,the defender should not conduct too frequent system migrationsbecause each migration incurs a switching cost that depends onboth the current configuration and the following configuration.Thus, the defender must make a careful tradeoff between the effec-tiveness and the cost efficiency of the MTD techniques. The tradeoffrequires the defender to simultaneously decide both the optimalsequences of system configurations (i.e., the next configurationto be switched to) and the optimal timing for switching. To thisend, an MTD model must precisely characterizes the defender’sspatial-temporal decision-making.A natural approach to model the strategic interactions betweenthe defender and the attacker is to use game-theoretic models suchas the zero-sum dynamic game [37] and the Stackelberg Securitygame (SSG) model [26, 28, 32]. The dynamic game model is generalenough to capture different transition methods of the system statesand various information structures but optimal solutions to thegame are often difficult to compute. This model also assumes thatthe switching cost from one configuration to another is fixed [37].In the SSG model, the defender commits to a mixed strategy thatis independent of the state transitions of the system [28, 32]. Theattacker adopts a best-response strategy to the defender’s mixedstrategy. While optimal solutions to an SSG can be obtained effi-ciently, the SSG models neglect the fact that both the defender’sand the attacker’s strategy can be contingent on the states of thesystem configurations. To address this problem, Feng et al. incorpo-rate both Markov decision process and Stackelberg game into themodeling of the MTD game [7].Many MTD models [5, 10, 26, 28] do not explicitly introducethe concept of the defending period, although they usually assumethat the defender chooses to execute a migration after a constanttime period [27]. A primary reason is that when both time and thesystem’s state influence the defender’s decision making, optimalsolutions to the defender’s MTD problem are non-trivial [16, 27].Recently, Li and Zheng has proposed to incorporate timing intothe defender’s decision making processes [15]. Their work assumesthat there is a positive transition probability between any twoconfigurations (so that the corresponding Markov decision processis unichain), which may lead to sub-optimal MTD strategies. Further, a r X i v : . [ c s . G T ] F e b enger Li, Wen Shen, Zizhan Zheng their model assumes that all the attackers have the same type, whichmight not be true in reality. To solve the defender’s optimal MTDproblem in more general settings, an MTD model that preciselymodels the strategic interactions between the defender and theattacker is in urgent need. Our contributions.

In this paper, we propose a general frame-work called the

Markov Stackelberg Game (MSG) model for spatial-temporal moving target defense. The MSG model enables the de-fender to implement optimal defense strategy that is contingenton both the source states and the destination states of the system.It also allows the defender to simultaneously decide which statethe system should be migrated to and when it should be migrated.In the MSG model, we formulate the defender’s optimization prob-lem as an average-cost semi-Markov decision process (SMDP) [24]problem. We present a value iteration algorithm that can solvethe average-cost SMDP problem efficiently after transforming theoriginal average-cost SMDP problem into a discrete time Markovdecision process (DTMDP) problem. We empirically evaluate ouralgorithm using real-world data obtained from the National Vul-nerability Database (NVD) [21]. Experimental results demonstratethe advantages of using MSG over the state-of-the-art approachesin MTD.

Our MSG model is built on an important game-theoretic modelcalled Stackelberg games [34] that has broad applications in manyfields, including spectrum allocation [36], smart grid [35] and se-curity [29]. In a Stackelberg game, there are two players: a leaderand a follower. The leader acts first by committing to a strategy.The follower observes the leader’s strategy and then maximizeshis reward based on his observation [1]. Each player in a Stackel-berg game has a set of possible pure strategies (i.e., actions) thatcan be executed. A player can also play a mixed strategy that is adistribution over the pure strategies.Stackelberg games have been extensively used to model thestrategic interactions between a defender and an attacker in securitydomains [29]. This category of Stackelberg games is often called

Stackelberg security games (SSGs). In SSGs, a defender aims to defenda set of targets using a limited number of resources. An attackerlearns the defender’s strategy and mounts attacks to maximize hisbenefits. The pure strategy for a defender is an allocation of thedefender’s limited resources to the targets, while a mixed strategyis a probability distribution over all the possible allocations. Thedefender’s optimization problem in an SSG is to compute a mixedstrategy that maximizes her expected utility (i.e., minimizes herexpected losses) given that the attacker learns the defender’s mixedstrategy and performs a best response to the defender’s action.Whenever there is a tie, the attacker always breaks the tie in favorof the defender [19]. This solution concept is called the strongStacklberg equilibrium [4]. In our paper, we consider the strongStacklberg equilibrium as the solution concept for the MSG game.

In this section, we describe the Markov Stackelberg Game (MSG)model for moving target defense where a defender and an attackercompete for taking control of the system. Specifically, we first introduce key notations used to model the system configuration,the defender and the attacker. We then present both the attacker’sand the defender’s optimization problems in the spatial-temporalmoving target defense game. Finally, we compare our MSG modelwith the Stackelberg Security Game model [22, 29] that has beenextensive studied in security research.

In moving target defense, a defender proactively shifts variablesof a computing system to increase uncertainty for the attacker tomount successful attacks. The variables of the system are oftencalled adaptive aspects [38, 39]. Typical adaptable aspects includeIP address, port number, network protocol, operating system, pro-gramming language, machine replacement, and memory to cachemapping [38]. A computing system usually have multiple adapt-able aspects. Let D ∈ N denote the number of adaptive aspects forthe computing system and Φ i the set of sub configurations in the i th adaptive aspect. If the defender selects the sub configuration ϕ i ∈ Φ i for the i th adaptive aspect, then the configuration state ofthe system is denoted as s = ( ϕ , ϕ , . . . , ϕ D ) . Here, s is a genericelement of the set of system configurations S = Φ × Φ × · · · × Φ D .Let n = | S | denote the number of system configurations in S . Example 3.1.

Consider a mobile app with two adaptive aspects:the programming language aspect Φ = { Python , Java , JavaScript } and the operating system aspect Φ = { iOS , Android } . If the de-fender selects Java for the programming language aspect and iOS for the operating system aspect, then the composite system con-figuration is s = ( ϕ , ϕ ) = ( Java , iOS ) . The maximum numberof valid configurations for the app is 6. However, it is likely thatonly a subset of the configuration set is valid due to both physicalconstraints and performance requirement of the system. We consider a stealthy and persistent attacker that constantly ex-ploits vulnerabilities of the system configurations to gain benefits.

A successful attack requires a competentattacker that has the expertise to exploit the vulnerabilities. Forinstance, to gain control of a remote computer, the attacker needsthe capability to obtain the control of at least one of the followingresources: the IP address of the computer, the port number of aspecific application running on the machine, the operating systemtype, the vulnerabilities of the application, or the root privilegeof the computer [38]. The attacker’s ability profile is called the attacker type . There are different attacker types. Let L denote theset of attacker type. Each attacker type l ∈ L is capable of mountinga set of attacks A l . The attacker type space A l is a nonempty setof the attack methods that each targets one vulnerability in oneadaption aspect, which may affect multiple sub configurations inthat aspect. Multiple attacks may target the same vulnerability butmay have different benefit (loss) to the attacker (defender) as definedbelow. Whether an attack method belongs to an attacker type spaceor not depends on the application scenarios. Figure 1 illustrates therelationship between attacker types, attack methods, adaptationaspects, sub-configuration parameters and system configurations. patial-Temporal Moving Target Defense: A Markov Stackelberg Game Model | | Φ Φ Φ Φ AdaptationaspectSystemconﬁguration

Attack method

Sub-conﬁgurationparameter

Set ofattacks foreachattackertype

Selected sub-conﬁguration parameterfrom each adaptation aspect State nState 1

Figure 1: An illustration of the relationship between at-tacker types, attack methods, adaptation aspects, sub-configuration parameters, and system configurations.

If an attacker with the attacker type l chooses an attack method a ∈ A l to attack the system in state j , then the time needed for him to compromise the system is arandom variable ξ a , j that is drawn from a given distribution Ξ a , j .If a is not targeting a vulnerability of any sub configuration instate j , ξ a , j = + ∞ . The attacker only gains benefits when he hassuccessfully compromised the system. A system is considered to becompromised as long as the attacker compromises one of the subconfigurations. In this work, L , { A l } , and { Ξ a , j } are assumed to becommon knowledge. We consider a defender that proactively switches among differentsystem configurations to increase the attacker’s cost of mountingsuccessful attacks. In practice, each migration incurs a cost to thedefender. Frequent migrations often bring high cost while infre-quent migrations make the system vulnerable to potential attacks.To make MTD feasible for deployment, the defender must also findthe optimal time period to defend. An optimal MTD solution for thedefender is to compute a strategy that simultaneously determineswhere to migrate and when to migrate.

During the migration process, the de-fender updates the system to retake maintain the control of thesystem and pay some updating cost, then selects and shifts thesystem from the current configuration state i ∈ S to the next validsystem configuration state j ∈ S with a cost m ij . We can considerthat the migration is implemented instantaneously. This is withoutloss of generality because one can always assume that during themigration the system is still at state i . If the defender decides tostay at the current state i , then the cost is m ii . Since there is a costfor updating the system, it requires that m ii > i , j ∈ S . Let M be the matrix of the migration cost between any states i , j ∈ S ,we have M = [ m ij ] n × n . It is crucial to note that the migration cost m ij may depend on both the source configuration state i and the destination configuration state j . The defender not only needs to deter-mine which configuration state to use but also should decide whento move. If the defender stays in a configuration state sufficiently long, the attacker is likely to compromise the system even if thedefender shifts eventually. Thus, the defender needs to pick thedefending period judiciously. Let t k denote the time instance thatthe k th migration happens, the k th defending period is calculatedby t k − t k − . Intuitively, a defending period τ should be greater thanzero, since the system needs some time to prepare the system mi-gration or the updating. The defending period cannot be infinitelylarge because the system must switch the configurations periodi-cally. Otherwise, it loses the benefits of moving target defense asthe system will be compromised eventually and stay compromisedafter that. Therefore, it is natural to require that the defendingperiod length τ has a lower bound τ ∈ R > and an upper bound τ ∈ R > . That is, τ ∈ [ τ , τ ] . Note that the unit time cannot beinfinitely small in practice. This allows us to discretize time foranalytic convenience. In our model, the defender adopts astationary strategy u that simultaneously decides where to move(depending on both the current state at k -th period and the nextstate at ( k + k -th period). Let p ij denote the probability that thedefender moves from state i to state j , then the transition probabili-ties between any two states in S can be represented by a transitionmatrix P = [ p ij ] n × n , where (cid:205) j ∈ S p ij =

1. When the system config-uration state is i , the defender uses a mixed strategy p i = ( p ij ) j ∈ S to determine the next state to switch to.In contrast, the defender needs to determine the next defend-ing period τ i ∈ [ τ , τ ] when she moves the system from currentstate i to the next state. The defending period matrix is thus τ = [ τ , τ , . . . , τ n ] where i ∈ S . In our model, we assume that the next( k + i at k -th period. There are two considerations: first, the transi-tion probabilities p i have already captured the differences betweendifferent destination states; second, it allows the defender’s prob-lem to be modeled as a semi-Markov decision process (defined inSection 3.4.3). The defender’s stationary strategy is denoted by u = ( P , τ ) = [ u ( i )] i ∈ S where u ( i ) = ( p i , τ i ) is the defender’s actionwhen the system’s current configuration state is i . See Figure 2 foran illustration of the system model. Note that the subscript i in τ i means the period length depends on state i at k -th period. The timeperiod τ i is indifferent to the next state j at ( k + j ∈ S . In our paper, we model moving target defense as a Markov Stackel-berg game (MSG) where the defender is the leader and the attackeris the follower . At the beginning the defender commits to a station-ary strategy that is announced to the attacker. The attacker’s goalis to maximize his expected reward by constantly attacking thesystem’s vulnerabilities. The defender’s objective is to minimizethe total loss due to attacks plus the total migration cost.The MSG model is an extension of the Stackelberg Security Game(SSG) model that has been widely adopted in security domains [29].A key advantage of using the Stackelberg game model is that it en-ables the defender acting first to commit to a mixed strategy whilethe attacker selects his best response after observing the defender’s enger Li, Wen Shen, Zizhan Zheng

Conﬁguration

Time

State 1State 2State 3

Figure 2: An illustration of the Markov Stackelberg gamemodel. The first defending period (between t and t ) de-pends on the initial state (assuming State is the initial con-figuration state). A light green block represents the time pe-riod when the system is protected while a dark red blockdenotes the time period when the system is compromised.Here p ij is the probability that the defender moves from con-figuration state i to j , and τ i is the length of the current de-fending period when the previous configuration is i . strategy [14]. This advantage allows the defender to implement de-fense strategies prior to potential attacks [15]. In MTD, the defenderproactively switches the system between different configurationsto increase the attacker’s uncertainty. It is thus natural to modelthe defender as the leader and the attacker as the follower. In our model, we consider a stealthyand persistent attacker that learns the defender’s stationary strat-egy u . Note that even if the defender’s strategy is not announcedinitially, the attacker will learn the system states (e.g., throughprobing) and the defender’s strategy eventually due to the stealthyand persistent nature of the attacks. Thus, it is without loss ofgenerality to assume that the defender announces her stationarystrategy to the attacker before the game starts. We further assumethat the attacker always learns the previous state of the ( k − k -th period no matter the attack issuccessful or not in the previous ( k − π = { π l } on the probability distribution of the attacker type l . The attacker chooses anattack method at the beginning of each stage to maximize its long-term reward. In our model, the defender adopts a stationary strategythat is known to the attacker. Thus, the long-run optimality can beachieved if the attacker always follows a best response in each stage.Hence, we consider a myopic attacker that aims to maximize his benefits by always using a best response to the defender’s strategyaccording to its knowledge on the previous system configurationstate.Consider an arbitrary stage and assume that the previous config-uration state at the ( k − i and the current configurationstate at the k -th period is j . For an attack a to be successful in thisstage, the time ξ a , j required for the attacker when using attackmethod a targeting state j should be less than the length of the de-fending period τ i . Let R la , j denote the attacker’s benefit per unit oftime when the system is compromised, which is jointly determinedby the attacker’s type l , its chosen attack method a , and the state j .The reward that the attacker receives is then ( τ i − ξ a , j ) + R la , j where ( x ) + ≜ max ( , x ) .The optimization problem for the attacker with type l is to max-imize his expected reward per defending period by choosing anattack method a from his attack space A l :max a ∈ A l (cid:213) j ∈ S p ij E [( τ i − ξ a , j ) + ] R la , j (1)where τ i and { p ij } are the defending period length and the transi-tion probabilities given by the defender under state i , respectively. We consider a defenderthat constantly migrates among the system configurations in orderto minimize the total loss due to attacks plus the total migrationcost. We use an average cost semi-Markov decision process (SMDP)to model the defender’s optimization problem. The SMDP modelconsiders a defender that aims to minimize her long-term defensecost in an infinite time horizon using spatial-temporal decisionmaking.Let C la , j denote the unit time loss for the defender under state j due to an attack a launched by a type l attacker. The defender’saction at state i is u ( i ) = ( p i , τ i ) and her expected single period costis: c ( i , u ( i )) = (cid:213) l ∈ L π l (cid:169)(cid:173)(cid:171)(cid:213) j ∈ S p ij E [( τ i − ξ a l , j ) + ] C la l , j (cid:170)(cid:174)(cid:172) + (cid:213) j ∈ S p ij m ij s . t . a l = arg max a ∈ A l (cid:169)(cid:173)(cid:171)(cid:213) j ∈ S p ij E [( τ i − ξ a , j ) + ] R la , j (cid:170)(cid:174)(cid:172) , ∀ l ∈ L (2)where the first part in the objective function is the expected attack-ing loss and the second part is the expected migration cost.The game starts at t = s be the initial state that israndomly selected from the state space S . The defender adopts astationary policy u where u ( s k ) = ( p s k , τ s k ) for each state s k , whichgenerates an expected cost c ( s k , u ( s k )) that includes potential lossfrom compromises and migrations. Given the initial state s , thedefender’s long-term average cost is defined as: z ( s , u ( s )) = lim inf N →∞ (cid:205) N − k = c ( s k , u ( s k )) (cid:205) N − k = τ s k (3)The goal of the defender is to commit to a stationary policy u ∗ = [ u ∗ ( i )] n that minimizes the time-average cost for any initialstate. z ( i , u ∗ ( i )) = inf u z ( i , u ( i )) , ∀ i ∈ S For each u ( i ) = ( p i , τ i ) , wehave p ij ∈ [ , ] for all j , (cid:205) j p ij =

1, and τ i ∈ [ τ , τ ] . Thus, the patial-Temporal Moving Target Defense: A Markov Stackelberg Game Model action space U ( i ) for every u ( i ) is [ , ] n × [ τ , τ ] , which is a contin-uous space. We assume that c ( i , u ( i )) is continuous over U ( i ) . Thedefender’s optimization problem corresponds to finding a strongStackelberg equilibrium [4] where the defender commits to anoptimal strategy assuming that the attacker will choose the bestresponse to the defender’s strategy and break ties in favor of thedefender. This is a common assumption in Stackelberg securitygame literature. The aretwo main challenges for the defender to compute the optimal strate-gies. First, when an arbitrary transition matrix P is allowed, theMarkov chain associated with the given P may have a complicatedchain structure. The optimal solution may not exist and standardmethods such as Value Iteration (VI) and Policy Iteration (PI) [24]may never converge when applied to an average-cost SMDP witha continuous action space. Second, a bilevel optimization problemneeds to be solved in each iteration of VI or PI, which is challengingdue to the infinite action space and the coupling of spatial andtemporal decisions. Our Markov Stackelberg game model extends the classic Stackel-berg Security Game (SSG) [22] in important ways. In the classicSSG model, there is a set of targets and a defender has a limitedamount of resources to protect them. The defender serves as theleader and commits to a mixed strategy. The attacker observes thedefender’s strategy (but not her action) and then responds accord-ingly. Thus, the SSG model is essentially a one-shot game. TheSSG model has been extended to Bayesian Stackelberg Game (BSG)model to capture multiple attack types where the defender knowsthe distribution of attack types a prior as we assumed. In a recentwork [29], the BSG model has been used to model moving target de-fense where only the spatial decision is considered and the defendercommits to a vector [ p j ] n where p j is the probability of moving toconfiguration state j in the next stage, which is independent of thecurrent configuration state.We note that the BSG model in [29] is a special case of our MSGmodel. Specifically, let τ i = i ∈ S and ξ la l , j = a l ∈ A l , j ∈ S . We further assume that p ij = p j for all j ∈ S .Then the SMDP becomes an MDP with state independent transitionprobabilities. Since each row of the transition matrix is the same, thestationary distribution of the corresponding Markov chain is just [ p j ] n . Therefore, the average-cost SMDP reduces to the followingone-stage optimization problem to the defender:min p (cid:213) l π l (cid:16) (cid:213) j p j C la l , j (cid:17) + (cid:213) i , j p i p j m ij s . t . a l = arg max a ∈ A l (cid:16) (cid:213) j p j R la , j (cid:17) ∀ l ∈ L (4)This is exactly the BSG model for MTD in [29]. In the BSG variant,the transition probabilities depend on only the destination state.It corresponds to a special case in our model when the transitionprobabilities in each row are the same. This simplified MTD strategyis optimal only if the migration cost depends on the destinationstate only but not the source state. Our MSG model enables the defender to handle the complexscenarios when the migration cost is both source and destinationdependent. It also takes the defending period into account whenin computing the optimal defense strategy. This consideration isuseful because a stealthy and persistent attacker will compromisethe system eventually if the system stays in a state longer than thecorresponding attacking time. This section presents an efficient solution to the defender’s opti-mization problem in spatial-temporal moving target defense. Wefirst show that the original average-cost SMDP problem can betransformed into a discrete time Markov decision process (DTMDP)problem using a data transformation method. We then introduce avalue iteration algorithm to solve the DTMDP problem and provethat the algorithm converges to a nearly optimal MTD policy. Thealgorithm involves solving a bilevel optimization problem in eachiteration, which can be formulated as a mixed integer quadraticprogram.

Before we present our solution, we first make two assumptions.Assumption 1.

The transition probability matrix P can be arbi-trarily chosen by the defender. Assumption 2.

Given τ > as the lower bound of the defend-ing period length, the defender’s cost per unit time c ( i , u ( i ))/ τ i iscontinuous and bounded over U ( i ) for each i . Both assumptions are reasonable and can be easily satisfied.Assumption 1 implies an important structure property of the SMDPas formally defined below.

Definition 4.1 (Communicating MDP [24]).

For every pair of states i and j in S , there exists a deterministic stationary policy u underwhich j is accessible from i , that is, Pr ( s k = j | s = i , u ) > k ≥ Solving the defender’s optimization problem requires the algorithmto simultaneously determine the optimal transition probabilitiesand the optimal defending periods. The average-cost SMDP problemwith continuous action space is known to be difficult to solve [12].Fortunately, one can apply the data transformation method intro-duced by Schweitzer [25] to transform the average-cost SMDPproblem into a discrete-time average Markov decision process (DT-MDP) problem. The DTMDP has a simpler structure than the SMDPwith the same state space S , and action sets U ( i ) = [ u ( i )] n , where u ( i ) = ( p i , τ i ) . The defender’s per-stage cost c ( i , u ( i )) is converted enger Li, Wen Shen, Zizhan Zheng to ˜ c ( i , u ( i )) = c ( i , u ( i )) τ i (5)Further, the transition probability from state i to state j for theDTMDP is ˜ p ij ( u ( i )) = γ p ij − δ ij τ i + δ ij (6)where δ ij denotes the Kronecker delta (i.e., δ ii = δ ij = j (cid:44) i ) and γ is a parameter that satisfies 0 < γ < τ ≤ τ i − p ii ,where τ is the lower bound of the defending period length. Let˜ P ( u ) = [ ˜ p ij ( u ( i ))] n × n denote the transition probability matrix ofthe DTMDP and ˜c ( u ) = [ ˜ c ( i , u ( i ))] n the defender’s per-stage costacross all the states. If the system starts from the initial state s ∈ S ,then the long-term average cost becomes˜ z ( s , u ( s )) = lim sup N →∞ N N − (cid:213) k = ˜ c ( s k , u ( s k )) (7)The above data transformation has some nice properties as sum-marized below.Theorem 4.2 (Theorems 5.2 and 5.3 of [12]). Suppose an SMDPis transformed into a DTMDP using the above method. We have (1)

If SMDP is communicating, then DTMDP is also communicat-ing. (2)

If SMDP is communicating, then a stationary optimal policyfor DTMDP is also optimal for SMDP.

Theorem 4.2 indicates that the transformed DTMDP also has acontant optimal cost and further, to find a stationary optimal policyfor the SMDP in our problem, it suffices to find a stationary optimalpolicy for the transformed DTMDP.

We adapt the

Value Iteration (VI) algorithm [2, 3] to solve the de-fender’s problem and prove that the algorithm converges to a nearlyoptimal MTD policy. Before presenting the algorithm and the theo-retical analysis, we introduce additional notations.

Let V be any vector in R n . We definethe mapping F : R n → R n as: F ( V ) = min u [ ˜ c ( u ) + ˜ P ( u ) V ] (8)where the minimization is applied to each state i separately. Forany vector x = ( x , x , . . . , x n ) ∈ R n , let L ( x ) = min i = ,..., n x i and H ( x ) = max i = ,..., n x i . Let ∥ · ∥ denotes the span seminorm definedas follows: ∥ x ∥ = H ( x ) − L ( x ) (9)It is easy to check that ∥ · ∥ satisfies the triangle inequality, that is, ∥ x − y ∥ ≤ ∥ x ∥ + ∥ y ∥ for any x , y ∈ R n . Further, ∥ x − y ∥ = λ such that x − y = λ e where e is anvector of all ones. Thus, there is a vector V such that ∥ F ( V )− V ∥ = V is called a fixed point of F (·) ) if and only if there is a scalar λ such that the following optimality equation is satisfied: λ e + V = min u [ ˜ c ( u ) + ˜ P ( u ) V ] (10)An important result in MDP theory [3, 24] is that the stationarypolicy u that attains the minimum in the optimality equation (10)is optimal and λ gives the optimal long-term average cost. Algorithm 1

Value Iteration algorithm for the MTD game

Input: S , n , ϵ > , τ , τ , M , C , R , π , L , { A l } , δ > . Output: P ∗ , τ ∗ V ∈ R n ; κ = , κ t = tt + for t = , , . . . . repeat t = t + V t = PImp ( S , V t − , τ , τ , M , C , R , π , L , { A l } , δ , κ t − ) V = max i ∈ S | V t ( i ) − V t − ( i )| ; V = min i ∈ S | V t ( i ) − V t − ( i )| ; until V − V < ϵ P ∗ , τ ∗ = arg PImp ( S , V t − , τ , τ , M , C , R , π , L , { A l } , δ ) Algorithm 2

Policy Improvement (

PImp ) Input: S , V , τ , τ , M , C , R , π , L , { A l } , δ , κ Output: V t for i ∈ S do v = + ∞ ; for τ = τ ; τ ≤ τ ; τ = τ + δ do V t ( i ) = min p i [ ˜ c ( i , τ , M , C , R , π , L , { A l }) + κ (cid:205) j ∈ S ˜ p ij ( p i , τ ) V ( j )] if V t ( i ) < v then v = V t ( i ) ; end if end for V t ( i ) = v ; end for The VI algorithm (See Algorithm 1)maintains a vector V t ∈ R n . The algorithm starts with an arbitrary V (line 1) and a carefully chosen sequence { κ t } (line 2) that ensuresevery limit point of { V t } is a fixed point of F ( V ) (See Section 4.3.3).In each iteration, V t is updated by solving a policy improvementstep ( V t = PImp ( S , V t − , τ , τ , M , C , R , π , L , { A l } , δ , κ t − ) ) in line5. In each policy improvement step (Algorithm 2), instead of findingthe optimal p i and τ i together for each state i , which is a challengingproblem, we discretize [ τ , τ ] and search for τ i with a step size δ (line 3). This approximation is reasonable since in practice the unittime cannot be infinitely small. Note that the smaller δ is, the closer τ ∗ (line 7) is to the optimal one. Also note that the optimizationproblem in line 4-5 is actually a bilevel problem, which will bediscussed in detail in Section 4.4.Under Assumptions 1 and 2, the algorithm 1 stops in a finitenumber of iterations (lines 6-8) and is able to find a near-optimalpolicy P ∗ , τ ∗ = arg PImp ( S , V t − , τ , τ , M , C , R , π , L , { A l } , δ ) (for-mally proved in Section 4.3.3). In practice, a near-optimal solutionis sufficient because it can be expensive or even unrealistic to ob-tain the exact minimum average cost in a large-scale MDP. Thealgorithm itself, however, still attains the optimal solution if thenumber of iterations goes to infinity (and δ approaches 0). Our VI algorithm is adapted fromthe work due to Bertsekas [3] and Bather [2] that originally ad-dresses the average cost MDP problem with a finite state space patial-Temporal Moving Target Defense: A Markov Stackelberg Game Model and an infinite action space. However, their proofs do not directlyapply to our algorithm because they either consider the transitionprobabilities as the only decision variables [2] or involve the useof randomized controls [3]. In contrast, our strategy includes boththe probability transition matrix and the (deterministic) defendingperiods.For a given stationary policy u with transition matrix ˜ P , let ˜ P ∗ denote the Cesaro limit given by ˜ P ∗ = lim N →∞ { I + ˜ P + ˜ P + · · · + ˜ P N − }/ N . Then the average cost associated with u can be repre-sented as ˜ P ∗ ˜ c ( u ) [24]. The policy u is called ϵ -optimal if ˜ P ∗ ˜ c ( u ) ≤ λ + ϵ e where λ is the optimal cost vector. In practice, it is oftenexpensive or even unrealistic to compute an exact optimal policyand an ϵ -optimal policy might be good enough. Our main resultscan be summarized as follows.Theorem 4.3. Under Assumptions 1 and 2, we have (1)

The DTMDP problem (thus the SMDP problem too) has anoptimal stationary policy; (2)

The sequence of policies in Algorithm 1 eventually leads to an ϵ -optimal policy.Proof Sketch: The first part can be proved using the similar tech-niques in the proofs of Theorem 2.4 of [2] and Proposition 5.2 of [3].The main idea is to show that (1) {∥ V t ∥} is bounded thus the vectorsequence { V t } must have a limit point; (2) every limit point of { V t } is a fixed point of F (·) , thus leading to an optimal solution.The second part follows from Theorem 6.1 and Corollary 6.2 of [2].The main idea is to show that (1) if ∥ F ( V t ) − V t ∥ ≤ ϵ , then thecorresponding policy is ϵ -optimal; (2) lim t →∞ ∥ F ( V t ) − V t ∥ = 0(again using the boundedness of {∥ V t ∥} ) so that ∥ F ( V t ) − V t ∥ ≤ ϵ holds eventually. Note that this condition is exactly the stoppingcondition V − V < ϵ in Algorithm 1 (line 8).The proof of Theorem 4.3 relies on the key property that thevector sequence {∥ V t ∥} generated by Algorithm 1 is bounded. Dueto the coupling of spatial and temporal decisions in our problem,the techniques in [2, 3] cannot be directly applied to prove this fact.Below we provide a detailed proof on this result by adapting thetechniques in [2, 3]. We prove that {∥ V t ∥} is bounded.Lemma 4.4. Let Assumptions 1 and 2 hold, and { κ t } be a nonde-creasing sequence with κ t ∈ [ , ] for each t . Consider a sequence { V t } where V t + = F ( κ t V t ) = min P , τ [ ˜ c ( P , τ ) + κ t ˜ P ( P , τ ) V t ] , then {∥ V t ∥} is bounded. Proof. Without loss of generality, we assume V = e . Since˜ c ( P , τ ) is bounded according to Assumption 2, there exists a con-stant β such that 0 e ≤ ˜ c ( P , τ ) ≤ β e . Then using the fact that κ t is nondecreasing, we can shown that { V t } is nondecreasing byinduction. For a communicating system, for each pair of states i and j , there exists a stationary policy u ij such that j is accessiblefrom i . Now we combine the transition probability matrices fromthese policies to form a new transition probability matrix Q = n n (cid:213) i = n (cid:213) j = ˜ P ( u ij ) (11) The key observation is that the matrix Q is also a valid transi-tion probability matrix, that is, there exists a policy u such that Q = ˜ P ( u ) . This is because the defender can choose arbitrary migra-tion probabilities and the defending period lengths as her strategy.Formally, for an arbitrary transition probability matrix Q = [ q ij ] given by (11), we may solve the data-transformation equations toidentify the corresponding policy u = ( P , τ ) . q ij = γp ij τ i , ∀ j (cid:44) i (12) q ii = γ ( p ii − ) τ i + Q = [ q ij ] given by (11), we must have q ij ∈ [ , γ / τ ] for j (cid:44) i and q ii ∈ [ − γ / τ , ] using the definition of the data trans-formation. Define a policy u = ( P , τ ) where τ = τ e and P = [ p ij ] isgiven by p ij = τq ij γ , ∀ j (cid:44) ip ii = − τγ ( − q ii ) Using the above mentioned properties of q ij and the definition of γ , it is easy to verify that P is a probability transition matrix and u = ( P , τ ) defined above satisfies equations (12) and (13).From the definition of Q , every state is accessible from everyother state. Define T ij as the expected number of transitions toreach j from i , we have T ij = + (cid:213) k (cid:44) j q ik T kj , i , j ∈ S , i (cid:44) j We can then show by induction that V t ( i ) ≤ βT ij + V t ( j ) , i , j ∈ S , i (cid:44) j , t = , , . . . It follows that ∥ V t ∥ ≤ max { βT ij | i , j ∈ S , i (cid:44) j } Thus, ∥ V t ∥ is bounded. This concludes the proof. □ To compute the optimal defense strategy with Algorithm 1, we needto solve the following optimization problem for a given scalar τ and a vector V t − (line 5, 9 in Algorithm 1): V t ( i ) = min p i [ ˜ c ( i , p i , τ ) + κ t − (cid:213) j ∈ S ˜ p ij ( p i , τ ) V t − ( j )] Substitute ˜ c ( i , p i , τ ) and ˜ p ij ( p i , τ ) by their definitions in Equations (5)and (6), and denote w lj , a ≜ E [( τ i − ξ la , j ) + ] and θ j ≜ m ij + γκ t − V t − ( j ) to simplify the notation. The defender’s optimization problem thensimplifies to (with the constant terms in the objective function enger Li, Wen Shen, Zizhan Zheng Algorithm 3

Relative Value Iteration algorithm for the MTD game

Input: S , n , ϵ > , τ , τ , M , C , R , π , L , { A l } , δ > . Output: P ∗ , τ ∗ V ∈ R n ; W = V − V ( s ) e ; κ = , κ t = tt + for t = , , . . . . repeat t = t + V t = PImp ( S , W t − , τ , τ , M , C , R , π , L , { A l } , δ , κ t − ) W t = V t − V t ( s ) e V = max i ∈ S | V t ( i ) − V t − ( i )| ; V = min i ∈ S | V t ( i ) − V t − ( i )| ; until V − V < ϵ P ∗ , τ ∗ = arg PImp ( S , W t − , τ , τ , M , C , R , π , L , { A l } , δ , κ t − ) omitted): min p i (cid:213) l π l (cid:169)(cid:173)(cid:171)(cid:213) j p ij w lj , a l C la l , j (cid:170)(cid:174)(cid:172) + (cid:213) j p ij θ j s . t . p ij ∈ [ , ] , ∀ j ∈ S ; (cid:213) j ∈ S p ij = a l = arg max a ∈ A l (cid:169)(cid:173)(cid:171)(cid:213) j p ij w lj , a R la , j (cid:170)(cid:174)(cid:172) ∀ l ∈ L . Using the similar technique for solving Bayesian Stackelberg games [22],this bilevel optimization problem can be modeled as a Mixed IntegerQuadratic Program (MIQP):min p i , n , v (cid:213) j ∈ S (cid:213) l ∈ L (cid:213) a ∈ A l π l w lj , a C la , j p ij n la + (cid:213) j p ij θ j s . t . (cid:213) j ∈ S p ij = (cid:213) a ∈ A l n la = , ∀ l ∈ L ≤ v l − (cid:213) j p ij w lj , a R la , j ≤ ( − n la ) B , ∀ a ∈ A l , l ∈ L (14) p ij ∈ [ , ] , n la = { , } , v l ∈ R , ∀ j ∈ S , a ∈ A l , l ∈ L where the binary variable n la = a ∈ A l is the bestaction for the type l attacker. This is ensured by constraint (14)where v l is an upper bound on the attacker’s reward and B is alarge positive number. The VI algorithm can be slow in practice due to the large numberof iterations needed to converge and the complexity of solvingmultiple MIQP problems in each iteration. To obtain a more efficientsolution, we introduce a Relative Value Iteration (RVI) algorithm(see Algorithm 3). The RVI algorithm maintains both a state-valuevector V t similar to the VI algorithm as well as a relative state-valuevector W t defined as W t = V t − V t ( s ) e where s is a fixed state(line 2). The RVI algorithm again starts with an arbitrary vector V .In each iteration t , V t and W t are updated as V t = F ( W t − ) (line 6) where F is defined in Equation 8 and W t + = V t − V t ( s ) e (lines7). Under Assumptions 1 and 2, we can show that the sequence { W t } converges to a vector W ∗ satisfying the optimality equation F ( W ∗ )( s ) e + W ∗ = F ( W ∗ ) similar to Equation (10), where F ( W ∗ )( s ) gives the optimal average cost. Further, Theorem 4.3 still holds bysubstituting ∥ V t + − V t ∥ with ∥ W t + − W t ∥ .Theoretically, the RVI algorithm does not help improve the effi-ciency of the VI algorithm since W t and V t only differ by a multipleof the vector of all ones and the bilevel optimization problems in-volved in both algorithms are mathematically equivalent. In termsof converge rates, ∥ W t + − W t ∥ has the same bound as ∥ V t + − V t ∥ ,as determined by the sequence { κ t } (line 2 in Algorithm 1 and line3 in Algorithm 3) and the upper bound of ∥ V t ∥ = ∥ W t ∥ . In practice,however, the RVI algorithms often requires smaller number of iter-ations for the same ϵ . In addition, if we introduce the assumptionthat p is ≥ ρ > i ∈ S (where ρ can be arbitrarily small) tofurther restrict the Markov chain structure and set κ t = t ,the rate of convergence can be significantly improved [3] for boththe VI and the RVI algorithms. We conducted numerical simulations using the real data from theNational Vulnerability Database (NVD) [21] to demonstrate theadvantages of using MSG for spatial-temporal MTD. In particular,we derived the key attack/defense parameters from the CommonVulnerabilities Exposure (CVE) scores [17] in NVD, which has beenwidely used to describe the weakness of a system with respectto certain risk levels. We used data samples in NVD with CVEscores ranging from January 2013 to August 2016. As in [28], theBase Scores (BS) and Impact Scores (IS) were used to represent theattacker’s reward (per unit time) and the defender’s cost (per unittime) respectively. Further, we used the Exploitability Scores (ES)to estimate the distribution of attack time.We conducted two groups of experiments : the spatial decisionsetting and the joint spatial-temporal decision setting. We com-pared the MSG method with two benchmarks: the Bayesian Stack-elberg Game (BSG) model [28] and the Uniform Random Strategy(URS) [28]. We used the Gurobi solver (academic version 8.1.1) forthe MIQP problems in BSG and MSG. All the experiments were runon the same 24-core 3.0GHz Linux machine with 128GB RAM. In the spatial decision setting,the defender periodically moves in unit time length and the attackerinstantaneously compromises the system when he chooses the cor-rect configuration. We compared the MSG model with the originalBSG and URS models in [28]. In BSG, the defender determines thenext configuration according to a fixed transition probability vector [ p j ] n that is independent of the current configuration. In URS, thedefender selects the next configuration uniform randomly.For fair comparisons, we followed the same data generationmethod as used in the work by Sengupta et al. [28]. The systemhas four configurations S = {( PHP , MySQL ) , ( Python , MySQL ) , ( PHP , postдreSQL ) , ( Python , postдreSQL )} with the switching costshown in Figure 3. The default migration cost matrix M is [[ , , , ] , Code is available at https://github.com/HengerLi/SPT-MTD. patial-Temporal Moving Target Defense: A Markov Stackelberg Game Model

Figure 3: The migration cost in the MTD system. Each rowrepresents a source configuration and each column repre-sents a destination configuration. The updating costs areshown in the blue boxes. [ , , , ] , [ , , , ] , [ , , , ]] (we added an updating cost of2 to all the switching cost in [28].) We varied the updating cost (seethe numbers in the parentheses of the blue boxes in Figure 3) andevaluated the impact of different updating cost on the performanceof defender’s strategies.In this experiment, we considered three attacker types: the ScriptKiddie that could attack Python and

PHP , the Database Hackerthat is able to attack

MySQL and postдreSQL and the MainstreamHacker that could attack all the techniques. The defender possessesa prior belief of ( . , . , . ) on the three attacker types. The sizeeach of their attack space is 34 ,

269 and 48, respectively. For BSG,the defender’s optimization problem was directly solved with MIQPas in [28]. For MSG, the bi-level optimization problem was solvedwith MIQP for each configuration in every iteration of Algorithm 1(with the convergence parameter ϵ = . α to adjust the ratiobetween the attacking cost and the migration cost. That is, insteadof the m ij shown in Figure 3, we used αm ij as the migration costfrom state i to state j . As α increases, the migration cost has a largerimpact on the defender’s decisions. We varied the value of α from 0 to 2 . . τ i = i ∈ S , and ξ la l , j = a l ∈ A l , j ∈ S .Figure 4 shows that the defender’s cost increases for all the threepolicies as the migration cost grows. However, the magnitude ofincrease differs in the three policies. In URS, the cost increaseslinearly due to the uniform random strategy ( . , . , . , . ) used. In both MSG and BSG, the defender’s cost grows sub-linearly.However, the defender incurs substantially less cost in MSG. Thereason is that although both MSG and BSG enable the defender tochoose the respective optimal strategies, MSG allows the defenderto vary her strategy according to different source configurationswhile in BSG the defender must choose the same strategy for allthe source configurations. When α is small ( α ∈ [ , . ] ), BSG uses Alpha M T D D e f e n d e r ' s C o s t MSGBSGURS

Figure 4: A comparison of the defender’s cost in the threepolicies with spatial decisions only - MSG ( ϵ = . ), BSG andURS with unit defending period ( τ i = for all i ∈ S ) and zeroattacking time as the parameter α increases. XF XF XF XF 06*%6*856 Figure 5: The defender’s cost in the three policies with spa-tial decisions only - MSG ( ϵ = . ), BSG and URS under fourdifferent updating cost , , , with α = . The four settingschange the default updating cost 2 (in blue boxes of Figure3) to , , , while keeping the other values unchanged. ( , , . , . ) as the defender’s strategy, and MSG chooses almostthe same strategy for each configuration (MSG could do better if weimpose temporal decisions as shown below). When α ∈ [ . , . ] ,BSG uses strategies ( . , . , . , . ) and ( . , . , , . ) . Incontrast, as α grows within [ . , . ] , MSG chooses the most ben-eficial configuration ( PHP , postдreSQL ) (in term of the attackingcost plus the updating cost) as the absorbing state. At the turningpoint α = . α = .

2, the defender changes the spatial strategyat ( Python , MySQL ) from ( , , . , . ) to ( , , , ) and the spatialstrategy at ( PHP , MySQL ) from ( , , . , . ) to ( . , , , . ) totrade uncertainty for less migration cost. These source-dependentadjustments can’t be achieved by BSG. the Markov chain struc-ture with both the absorbing states (when the system moves in,it always stays there) and the transient states (once the systemmoves out, it never come back) cannot be achieved by BSG since enger Li, Wen Shen, Zizhan Zheng the only way to stay in a configuration using BSG is to assigna probability of 1 to that configuration for all source configura-tions, which, however, removes any uncertainty to the attacker.This indicates that MSG can achieve a better trade-off betweenthe migration cost and the loss from attacks, where the latter isdetermined by the uncertainty of the attacker. When α is large, theoverall migration cost is high. In BSG, the defender uses the strategy ( . , . , , ) (for α ∈ [ . , . ] ) to move between the two config-urations with relatively low migration cost (which is still higherthan the updating cost). For α ∈ [ . , . ] , MSG chooses the strat-egy (i.e. (0,1,0,0) instead of (0.5,0,0,0.5) at ( PHP , MySQL ) ) to avoidthe costly move from ( PHP , MySQL ) to ( Python , postдreSQL ) thathas a migration cost of 12, the defender chooses another path byfirstly moving from ( PHP , MySQL ) to ( Python , MySQL ) and thento ( Python , postдreSQL ) . This again brings some uncertainty to theattacker.We further compared the defender’s cost under four differentupdating cost settings (see Figure 5). In URS, the defender’s costlineally increases as updating cost grows due to the uniform spatialstrategy. In BSG, the defender’s cost stay the same because of theinaccurate estimation of p i p j using the piecewise linear McCormickenvelopes [28]. In contrast, MSG generates the accurate probabili-ties p ij corresponding to the configuration pair in each migration.We observe that as the updating cost increases, the gap of defender’scost between MSG and other two models decreases. The reason isthat when the divergence between the moving cost and updatingcost is flatted, the advantage of setting the most beneficial configu-ration as an absorbing state (i.e., avoiding unnecessary switching)become negligible. MSG has significant advantages over the othertwo methods when the migration cost matrix (e.g. the updatingcost is significantly smaller than moving cost) is unbalanced andthe cost values vary from different source configurations. In the joint spatial-temporaldecision setting, the defender needs to decide not only the next con-figuration to move to but also the length of each defending period τ .In our experiments, τ is in the range of [ . , . ] with an incrementparameter δ = .

1. For MSG, the optimal τ i for each configuration i was obtained together with the spatial decisions using Algorithm 1.We extended the BSG and URS policies by incorporating the attack-ing times and the defending periods into the objectives of both thedefender and the attacker (as we did in our MSG model), wherea fixed defending period is used for all the configurations sinceboth policies ignore the source configuration in each migration.To have a fair comparison with MSG, we searched for the optimaldefending period for BSG and URS, respectively, by solving thefollowing optimization problems:(1) Uniform Random Strategy (URS) with Temporal Decision:min τ (cid:205) l π l (cid:16) (cid:205) j E [( τ − ξ a , j ) + ] C la l , j (cid:17) + αn (cid:205) i , j m ij nτs . t . a l = arg max a ∈ A l (cid:16) (cid:213) j n E [( τ − ξ a , j ) + ] R la , j (cid:17) ∀ l ∈ L Alpha S P T - M T D D e f e n d e r ' s C o s t MSGBSGURS

Figure 6: A comparison of the defender’s cost in the threespatial-temporal policies - MSG ( ϵ = . ), BSG and URS asthe parameter α increases. ξ a , j ∼ Exp ( ES a ) . (2) Bayesian Stackelberg Game (BSG) with Spatial-TemporalDecisions:min p , τ (cid:205) l π l (cid:16) (cid:205) j p j E [( τ − ξ a , j ) + ] C la l , j (cid:17) + α (cid:205) i , j p i p j m ij τs . t . a l = arg max a ∈ A l (cid:16) (cid:213) j p j E [( τ − ξ a , j ) + ] R la , j (cid:17) ∀ l ∈ L We assigned a random attacking time for each attack that aims tocompromise the system. The random attacking time ξ a , j was drawnfrom the exponential distribution Exp ( ES a ) (the mean attackingtime is 1 / ES a ) when a is targeting a vulnerability in state j and ξ a , j = + ∞ otherwise. Here, ES a refers to the exploitability score of the vulnerability targeted by attack method a . The ES score ofa vulnerability is a value between 0 and 10 and a higher ES scoremeans it is easier to exploit the vulnerability [18].We used the same migration cost setting as the spatial deci-sion experiment with the default migration cost matrix [[ , , , ] , [ , , , ] , [ , , , ] , [ , , , ]] . For MSG, we set the conver-gence parameter ϵ as 0 .

1. For each vulnerability, we generated 1000samples from the corresponding exponential distribution and usedtheir average as the attacking time.

When the defender is able to decide when tomigrate, all three models produce lower cost than the respectivemodels with a fixed unit defending period (See Figure 6 and Fig-ure 4). When α is small, the attacking cost has a major impact onthe defender’s cost. The shorter defending periods lead to morefrequent switches that can efficiently increases the uncertainty ofthe attacker (Figure 7), thus reduce the attacking cost in the end.As α grows, the migration cost becomes the major factor and thedefender’s spatial decision plays a major role on the cost sincenow all models choose the same τ = . patial-Temporal Moving Target Defense: A Markov Stackelberg Game Model $OSKD ' H I HQG L QJ 3 H U L RG V /HQJ W K 06*06*06*06*%6*856 Figure 7: A comparison of the defender’s temporal decisionsin the three policies when α ∈ [ , . ] with a step size of . and τ ∈ [ , . ] with an increment δ = . . For all thethree policies, the optimal defending period of every con-figuration reaches . when α ≥ . . MSG-1, MSG-2, MSG-3and MSG-4 represent the defender’s temporal decisions inMSG for configurations S = {( PHP , MySQL ) , ( Python , MySQL ) , ( PHP , postдreSQL ) , and ( Python , postдreSQL )} , respectively. Alpha D e f e n d e r ' s C o s t MSGBSGBSG(tau=0.1)BSG(tau=1)BSG(tau=2.6)

Figure 8: A comparison of the defender’s cost between MSGand BSG with different temporal decisions ( τ = . , , . andthe optimal τ ). The cost parameter α ∈ [ , . ] has an incre-ment of . . The defender’s optimal spatial decision in BSGremains unchanged in these settings. adjustment in some levels by setting some absorbing states (zeromovement).We observed that for the spatial decisions, the defender in theMSG model tends to move to configurations with lower attack-ing cost, namely ( PHP , postдreSQL ) and ( Python , postдreSQL ) , andnever move out of them. That is, these two configurations are recur-rent states and the other two configurations are transient states ofthe corresponding Markov chain. When 0 . < α < .

9, the defendertends to move between the two recurrent states to increase the at-tacker’s uncertainty, while for larger α ≥ . ( Python , postдreSQL ) becomes an absorbing states to reduce the migration cost. In con-trast, the defender under BSG chooses the strategy ( , , . , . ) when α is small and adopts the uniform strategy ( . , . , . , . ) when α becomes large. The latter is because configurations withhigher average incoming migration cost have lower attacking cost.Besides, the total cost (the attacking cost plus the average incomingmigration cost) becomes more similar across different configura-tions when α becomes large. Essentially, MSG has advantages overBSG in all the scenarios because it allows a more refined trade-off between migration cost and attacking cost, where the latter isdetermined by the attacker’s uncertainty.For the temporal decisions, MSG, BSG and URS always choosethe maximum τ (i.e., 2 .

6) when α > .

3. This temporal strategyis reasonable when the migration cost weighs more than the at-tacking cost. In this scenario, increasing τ would decrease the fre-quency of migration. From Figure 8, we can see that the optimaltemporal decision τ increases as the migration cost weighs more ( α grows). Thus, another reason that MSG outperforms BSG for small α is because MSG enables the defender to choose a longer periodlength on the configurations that have lower attacking cost such as ( PHP , MySQL ) (see Figure 7). In this paper we consider a defender’s optimal moving target prob-lem in which both sequences of the system configurations and thetiming of switching are important. We introduce a Markov Stackel-berg Game framework to model the defender’s spatial and temporaldecision making that aims to minimize the losses caused by compro-mises of systems and the cost required for migration. We formulatethe defender’s optimization problem as an average-cost SMDP andtransform the SMDP problem into an DTMDP problem that can besolved efficiently. We propose an optimal and efficient algorithmto compute the optimal defense policies through relative value it-eration. Experimental results on real-world data demonstrate thatour algorithm outperforms the state-of-the-art benchmarks forMTD. Our Markov Stackellberg Game model precisely captures adefender’s spatial-temporal decision making in face of adaptive andsophisticated attackers.Our work opens up new avenues for future research. In ourpaper, we have considered the scenario when the defender hasprior information about the distribution of the attacker type. It isinteresting to study the case when the distribution of the attackertype is unknown to the defender. In our work, we have assumedthat the attacker is myopic. One may consider the MTD problemwhen the attacker has bounded rationality [13].

This work has been funded in part by NSF grant CNS-1816495. Wethank the anonymous reviewers for their constructive comments.We would like to thank Sailik Sengupta from the Arizona StateUniversity for kindly providing their MIQP code using in BSG withthe piecewise linear McCormick envelopes.

REFERENCES [1] Bo An, Milind Tambe, and Arunesh Sinha. 2016. Stackelberg security games (ssg)basics and application overview.

Improving Homeland Security Decisions (2016). enger Li, Wen Shen, Zizhan Zheng [2] John Bather. 1973. Optimal decision procedures for finite Markov chains. Part II:Communicating systems.

Advances in Applied Probability

5, 3 (1973), 521–540.[3] Dimitri P Bertsekas. 2012.

Dynamic Programming and Optimal Control (4thedition) . Vol. 1. Athena scientific Belmont, MA.[4] Michele Breton, Abderrahmane Alj, and Alain Haurie. 1988. Sequential Stack-elberg equilibria in two-person games.

Journal of Optimization Theory andApplications

59, 1 (1988), 71–97.[5] Ankur Chowdhary, Adel Alshamrani, Dijiang Huang, and Hongbin Liang. 2018.MTD analysis and evaluation framework in software defined network (MASON).In

Proceedings of the 2018 ACM International Workshop on Security in SoftwareDefined Networks & Network Function Virtualization . ACM, 43–48.[6] David Evans, Anh Nguyen-Tuong, and John Knight. 2011. Effectiveness of movingtarget defenses. In

Moving Target Defense . Springer, 29–48.[7] Xiaotao Feng, Zizhan Zheng, Prasant Mohapatra, and Derya Cansever. 2017. Astackelberg game and markov modeling of moving target defense. In

InternationalConference on Decision and Game Theory for Security . Springer, 315–335.[8] Jin B Hong and Dong Seong Kim. 2015. Assessing the effectiveness of movingtarget defenses using security models.

IEEE Transactions on Dependable andSecure Computing

13, 2 (2015), 163–177.[9] Sushil Jajodia, Anup K Ghosh, Vipin Swarup, Cliff Wang, and X Sean Wang. 2011.

Moving target defense: creating asymmetric uncertainty for cyber threats . Vol. 54.Springer Science & Business Media.[10] Sushil Jajodia, Noseong Park, Edoardo Serra, and VS Subrahmanian. 2018. Share:A stackelberg honey-based adversarial reasoning engine.

ACM Transactions onInternet Technology (TOIT)

18, 3 (2018), 30.[11] Quan Jia, Huangxin Wang, Dan Fleck, Fei Li, Angelos Stavrou, and Walter Powell.2014. Catch me if you can: A cloud-enabled DDoS defense. In . IEEE,264–275.[12] Liu Jianyong and Zhao Xiaobo. 2004. On average reward semi-Markov decisionprocesses with a general multichain structure.

Mathematics of Operations Research

29, 2 (2004), 339–352.[13] Debarun Kar, Fei Fang, Francesco Delle Fave, Nicole Sintov, and Milind Tambe.2015. A game of thrones: when human behavior models compete in repeatedstackelberg security games. In

Proceedings of AAMAS . IFAAMAS, 1381–1390.[14] Dmytro Korzhyk, Zhengyu Yin, Christopher Kiekintveld, Vincent Conitzer, andMilind Tambe. 2011. Stackelberg vs. Nash in security games: An extendedinvestigation of interchangeability, equivalence, and uniqueness.

Journal ofArtificial Intelligence Research

41 (2011), 297–327.[15] Henger Li and Zizhan Zheng. 2019. Optimal Timing of Moving Target Defense: AStackelberg Game Model. In

Proceedings of the 2019 IEEE Military CommunicationsConference (MILCOM) . IEEE.[16] Pratyusa K Manadhata. 2013. Game theoretic approaches to attack surfaceshifting. In

Moving Target Defense II . Springer, 1–13.[17] Peter Mell, Karen Scarfone, and Sasha Romanosky. 2006. Common vulnerabilityscoring system.

IEEE Security & Privacy

4, 6 (2006), 85–89.[18] Peter Mell, Karen Scarfone, and Sasha Romanosky. 2007. A complete guide to thecommon vulnerability scoring system version 2.0. In

Published by FIRST-Forumof Incident Response and Security Teams , Vol. 1. 23.[19] Thanh Hong Nguyen, Debarun Kar, Matthew Brown, Arunesh Sinha, Albert XinJiang, and Milind Tambe. 2016. Towards a science of security games. In

Mathe-matical Sciences with Multidisciplinary Applications . Springer, 347–381.[20] Hamed Okhravi, William W Streilein, and Kevin S Bauer. 2015.

Moving TargetTechniques: Leveraging Uncertainty for CyberDefense . Technical Report. MITLincoln Laboratory Lexington United States.[21] Patrick D O’Reilly. 2009. National vulnerability database (NVD). (2009).[22] Praveen Paruchuri, Jonathan P Pearce, Janusz Marecki, Milind Tambe, FernandoOrdonez, and Sarit Kraus. 2008. Playing games for security: An efficient exactalgorithm for solving Bayesian Stackelberg games. In

Proceedings of AAMAS .IFAAMAS, 895–902.[23] Wei Peng, Feng Li, Chin-Tser Huang, and Xukai Zou. 2014. A moving-targetdefense strategy for cloud-based services with heterogeneous and dynamic attacksurfaces. In . IEEE,804–809.[24] Martin L Puterman. 1994.

Markov Decision Processes: Discrete Stochastic DynamicProgramming . John Wiley & Sons.[25] Paul J Schweitzer. 1971. Iterative solution of the functional equations of undis-counted Markov renewal programming.

J. Math. Anal. Appl.

34, 3 (1971), 495–501.[26] Sailik Sengupta, Ankur Chowdhary, Dijiang Huang, and Subbarao Kambhampati.2018. Moving target defense for the placement of intrusion detection systems inthe cloud. In

International Conference on Decision and Game Theory for Security .Springer, 326–345.[27] Sailik Sengupta, Ankur Chowdhary, Abdulhakim Sabur, Dijiang Huang, AdelAlshamrani, and Subbarao Kambhampati. 2019. A Survey of Moving TargetDefenses for Network Security. arXiv preprint arXiv:1905.00964 (2019).[28] Sailik Sengupta, Satya Gautam Vadlamudi, Subbarao Kambhampati, Adam Doupé,Ziming Zhao, Marthony Taguinod, and Gail-Joon Ahn. 2017. A game theoreticapproach to strategy generation for moving target defense in web applications. In

Proceedings of AAMAS . IFAAMAS, 178–186.[29] Arunesh Sinha, Fei Fang, Bo An, Christopher Kiekintveld, and Milind Tambe.2018. Stackelberg Security Games: Looking Beyond a Decade of Success.. In

IJCAI . 5494–5501.[30] Aditya K Sood and Richard J Enbody. 2012. Targeted cyberattacks: a superset ofadvanced persistent threats.

IEEE security & privacy

11, 1 (2012), 54–61.[31] Marthony Taguinod, Adam Doupé, Ziming Zhao, and Gail-Joon Ahn. 2015. To-ward a moving target defense for web applications. In . IEEE, 510–517.[32] Satya Gautam Vadlamudi, Sailik Sengupta, Marthony Taguinod, Ziming Zhao,Adam Doupé, Gail-Joon Ahn, and Subbarao Kambhampati. 2016. Moving targetdefense for web applications using bayesian stackelberg games. In

Proceedings ofAAMAS . IFAAMAS, 1377–1378.[33] Shardul Vikram, Chao Yang, and Guofei Gu. 2013. Nomad: Towards non-intrusivemoving-target defense against web bots. In . IEEE, 55–63.[34] Heinrich Von Stackelberg. 2010.

Market structure and equilibrium . SpringerScience & Business Media.[35] Mengmeng Yu and Seung Ho Hong. 2015. A real-time demand-response algorithmfor smart grids: A stackelberg game approach.

IEEE Transactions on Smart Grid

7, 2 (2015), 879–888.[36] Jin Zhang and Qian Zhang. 2009. Stackelberg game for utility-based cooperativecognitiveradio networks. In

Proceedings of the tenth ACM international symposiumon Mobile ad hoc networking and computing . ACM, 23–32.[37] Quanyan Zhu and Tamer Başar. 2013. Game-theoretic approach to feedback-driven multi-stage moving target defense. In

International Conference on Decisionand Game Theory for Security . Springer, 246–263.[38] Rui Zhuang. 2015.

A theory for understanding and quantifying moving targetdefense . Ph.D. Dissertation. Kansas State University.[39] Rui Zhuang, Scott A DeLoach, and Xinming Ou. 2014. Towards a theory ofmoving target defense. In

Proceedings of the First ACM Workshop on MovingTarget Defense . ACM, 31–40.[40] Rui Zhuang, Su Zhang, Scott A DeLoach, Xinming Ou, and Anoop Singhal. 2012.Simulation-based approaches to studying effectiveness of moving-target networkdefense. In