Finding Effective Security Strategies through Reinforcement Learning and Self-Play
FFinding Effective Security Strategies throughReinforcement Learning and Self-Play
Kim Hammar †‡ and Rolf Stadler †‡† Division of Network and Systems Engineering, KTH Royal Institute of Technology, Sweden ‡ KTH Center for Cyber Defense and Information Security, SwedenEmail: { kimham, stadler } @kth.seOctober 6, 2020 Abstract —We present a method to automatically find securitystrategies for the use case of intrusion prevention. Following thismethod, we model the interaction between an attacker and adefender as a Markov game and let attack and defense strategiesevolve through reinforcement learning and self-play withouthuman intervention. Using a simple infrastructure configuration,we demonstrate that effective security strategies can emerge fromself-play. This shows that self-play, which has been applied inother domains with great success, can be effective in the contextof network security. Inspection of the converged policies showthat the emerged policies reflect common-sense knowledge andare similar to strategies of humans. Moreover, we address knownchallenges of reinforcement learning in this domain and presentan approach that uses function approximation, an opponent pool,and an autoregressive policy representation. Through evaluationswe show that our method is superior to two baseline methodsbut that policy convergence in self-play remains a challenge.
Index Terms —Network Security, Reinforcement Learning,Markov Security Games
I. I
NTRODUCTION
An organization’s security strategy has traditionally been de-fined, implemented, and updated by domain experts. Althoughthis approach can provide basic security for an organiza-tion’s communication and computing infrastructure, a growingconcern is that update cycles become shorter and securityrequirements change faster [1]. Two factors contribute to thisdevelopment: on the one hand, infrastructures keep changingdue to functional upgrades and innovation, and, on the otherhand, attacks increase in sophistication. As a response, sig-nificant efforts are made to automate security processes andfunctions , including automated threat modeling [2], [3], game-theoretic approaches [4], [5], evolutionary methods [6], [7],and learning-based solutions [8], [9].To automate the response to a cyber attack in a changingenvironment, we view the interaction between an attacker anda defender as a game, and model the evolution of attack anddefense strategies with automatic self-play. To avoid the needfor domain knowledge to be coded into agent strategies, westudy methods where the strategies evolve without humanintervention from random configurations. In particular, we useself-play, a setting where both the attacker and the defenderlearn by interacting with each other, as a technique to auto-matically harden the defense.Formally, we model the interaction between an attacker anda defender as a Markov game. We then use simulations of self-play where autonomous agents interact and continuouslyupdate their strategies based on experience from previouslyplayed games. To automatically update strategies in the game,several methods can be used, including computational gametheory, evolutionary algorithms, and reinforcement learning.In this work, we choose reinforcement learning, which allowsus to use simulations of self-play where both agents evolvetheir strategies simultaneously.In pursuing this approach, we face several challenges. First,from a game-theoretic perspective, the game between anattacker and a defender is one with partial observability. Thatis, a defender can typically not observe the state and locationof an attack, and an attacker is generally not aware of the fullstate of a defender’s infrastructure. Second, due to the scaleand complexity of computer infrastructures, the games thatwe consider have large state and action spaces, which makethem hard to analyze theoretically. Third, as both the attackerand the defender learn simultaneously during self-play, thereinforcement learning environment becomes non-stationary[10], which is a known challenge for reinforcement learningand often results in agent strategies that do not converge [11],[12], [13].We develop our method around the use case of intrusionprevention . This use case considers a defender that owns aninfrastructure that consists of a set of connected components(e.g. a communication network), and an attacker that seeksto intrude on the infrastructure. The defender’s objective isto protect the infrastructure from intrusions by monitoringthe network and patching vulnerabilities. Conversely, the at-tacker’s goal is to compromise the infrastructure and gainaccess to a critical component. To achieve this, the attackermust explore the infrastructure through reconnaissance andattack components along the path to the critical component. Inthis work, we study a relatively small and simple infrastructureof the defender, which allows us to easier compare securitystrategies and focus on the principles of our method.With this paper, we make the following contributions. First,we show, using a small infrastructure, that reinforcementlearning and self-play can be used to automatically discovereffective strategies for intrusion prevention. Although self-playhas been used successfully in other domains, such as boardgames [12] and video games [14], [13], our work is, to the bestof our knowledge, the first study of self-play in the context of a r X i v : . [ c s . L G ] O c t etwork and systems security. Second, we present a reinforce-ment learning method to address the challenges that appearin the context of our use case and show through evaluationsthat it outperforms two baselines. Specifically, to tackle theproblem of the high dimensional state space, we use functionapproximation implemented with neural networks. To addressthe non-stationary reinforcement learning environment, we usean opponent-pool technique, and to deal with the large actionspace, we propose an autoregressive policy representation.The remainder of this paper is structured as follows. Wefirst cover the necessary background information on partiallyobserved Markov decision processes, reinforcement learning,and Markov games. Then we introduce a game model forintrusion prevention and present our method together with ex-perimental results. Lastly, we describe how our work relates toprior research, present our conclusions, and provide directionsfor future work.II. POMDP S , R EINFORCEMENT L EARNING AND M ARKOV G AMES
This section contains background information on Markovdecision processes, reinforcement learning, and Markovgames.
A. Markov Decision Processes
A Markov Decision Process (MDP) is defined by a six-tuple M = (cid:104)S , A , P ss (cid:48) , R ass (cid:48) , γ, ρ (cid:105) [15]. S denotes the setof states and A denotes the set of actions. P ass (cid:48) refers to thetransition probability of moving from state s to state s (cid:48) whentaking action a (Eq. 1), which has the Markov property (Eq.2). Similarly, R ass (cid:48) ∈ R is the expected reward when takingaction a in state s to transition to state s (cid:48) (Eq. 3). Finally, γ ∈ (0 , is the discount factor and ρ : S (cid:55)→ [0 , is theinitial state distribution. P ass (cid:48) = P [ s t +1 = s (cid:48) | s t = s, a t = a ] (1) P [ s t +1 | s t ] = P [ s t +1 | s , . . . , s t ] (2) R ass (cid:48) = E [ r t +1 | a t = a, s t = s, s t +1 = s (cid:48) ] (3)A Partially Observed Markov Decision Process (POMDP) isan extension of an MDP [16], [17]. The difference comparedwith an MDP is that, in a POMDP, the agent does not directlyobserve the states s ∈ S . Instead, the agent observes obser-vations o ∈ O , which depend on the state s as defined by anobservation function Z . Specifically, a POMDP is defined byan eight-tuple M P = (cid:104)S , A , P ass (cid:48) , R ass (cid:48) , γ, ρ , O , Z(cid:105) . The firstsix elements define an MDP. O denotes the set of observationsand Z = P [ o | s ] denotes the observation function, where o ∈ O and s ∈ S . B. The Reinforcement Learning Problem
The reinforcement learning problem can be formulated asthat of finding a policy π ∗ that maximizes the expectedcumulative reward of an MDP over a finite horizon T [18]: π ∗ = arg max π E (cid:34) T (cid:88) t =0 γ t r t +1 (cid:35) (4) Graph Model Network Model N data N start N data N start Fig. 1: Modeling intrusion prevention as a Markov game.The Bellman equations [19] (Eq. 5-6) relates the optimalpolicy π ∗ to the states and the actions of the MDP, where π ∗ ( s ) = arg max a Q ∗ ( s, a ) . V ∗ ( s ) = max a E [ r t +1 + γV ∗ ( s t +1 ) | s t = s, a t = a ] (5) Q ∗ ( s, a ) = E [ r t +1 + γ max a (cid:48) Q ∗ ( s t +1 , a (cid:48) ) | s t = s, a t = a ] (6)A reinforcement learning algorithm is a procedure to computeor approximate π ∗ . Three main classes of such algorithmsexist: value-based algorithms (e.g. Q-learning [20]), policy-based algorithms (e.g. PPO [21]), and model-based algorithms(e.g. Dyna-Q [18]). C. Markov Games
Markov games, also known as stochastic games [22], pro-vide a theoretical framework for multi-agent reinforcementlearning [23]. A Markov game of N agents is defined as a tuple M G = (cid:104)S , A , . . . , A N , T , R , . . . R N , γ, ρ (cid:105) . S denotes theset of states and the list A , . . . , A N denotes the sets ofactions of agents , . . . , N . The transition function T is anoperator on the set of states and the combined action spaceof all agents, T : S × A × . . . × A N (cid:55)→ S . Similarly, thefunctions R , . . . R N define the rewards for agent , . . . , N where R i : S × A i (cid:55)→ R . Moreover, γ is the discountfactor and ρ : S (cid:55)→ [0 , is the initial state distribution.Finally, in the partially observed setting of a Markov game,the model also includes sets of observations O , . . . , O N aswell as observation functions Z , . . . , Z N [24].III. M ODELING I NTRUSION P REVENTION AS A M ARKOV G AME
In this section, we model the use case of intrusion preven-tion as a zero-sum Markov game that involves two agents, anattacker and a defender.
A. Intrusion Prevention as a Game
The right side of Fig. 1 shows the infrastructure underlyingour use case. It is depicted as a graph that includes fournetwork components. The component N start represents theattacker’s computer and the remaining components representthe defender’s infrastructure, where N data is the componentthat the attacker wants to compromise.2 A = [0 , , , S D = [5 , , , , S A = [0 , , , S D = [0 , , , , S A = [0 , , , S D = [9 , , , , S A = [0 , , , S D = [9 , , , , (a) Scenario . S A = [0 , , , S D = [9 , , , , S A = [0 , , , S D = [0 , , , , S A = [0 , , , S D = [1 , , , , S A = [0 , , , S D = [9 , , , , (b) Scenario . S A = [0 , , , S D = [1 , , , , S A = [0 , , , S D = [0 , , , , S A = [0 , , , S D = [1 , , , , S A = [0 , , , S D = [1 , , , , (c) Scenario . Fig. 2: Three scenarios of the intrusion prevention game. Scenarios and model infrastructures with strong defenses (first fourdefense attributes) but weak detection capabilities (fifth defense attribute). In scenario , each node contains one vulnerability(a defense attribute with a value ≤ ), whereas in scenario only one of the intermediary nodes has a vulnerability (the leftone). Scenario models an infrastructure with both weak defenses and weak detection capabilities.To compromise the target component N data , the attackermust explore the infrastructure through reconnaissance andcompromise components on the path to N data . At the sametime, the defender monitors the network and increases itsdefenses to prevent the attacker from reaching N data . In thisadversarial process, both the attacker and the defender havea partial view of the network. At the beginning of the game,the attacker knows neither the topology of the infrastructurenor its vulnerabilities. In contrast, the defender has completeknowledge of the network’s topology and the vulnerabilitiesof its components, but cannot observe the status of attacks.The described adversarial process evolves as a round-basedgame. In each round, the attacker and the defender performactions on components in the network, continuing until eitherthe attacker wins the game by compromising N data , or thedefender wins the game by detecting the attacker. B. A Formal Game Model of Intrusion Prevention
We extend prior work by others [25] and model the networkinfrastructure as a graph G = (cid:104)N , E(cid:105) (see Fig. 1 left side).The nodes N of the graph represent the components of theinfrastructure. Similarly, the edges E of the graph representconnectivity between components. The graph has two specialnodes, N start and N data , representing the attacker’s startingposition and target, respectively. Moreover, each node N k ∈ N has an associated node state, S k = (cid:104) S Ak , S Dk (cid:105) . The node’sdefense status S Dk is only visible to the defender and consistsof m + 1 attributes, S Dk, . . . , S Dk,m +1 . Likewise, the node’sattack status S Ak is only visible to the attacker and consistsof m attributes, S Ak, , . . . S Ak,m . The attribute values are naturalnumbers in the range , . . . , w , where w > .The values of the attack attributes S Ak, , . . . , S Ak,m representthe strength of m different types of attacks against node N k ,e.g. denial of service attacks, cross-site scripting attacks, etc.Similarly, the values of the defense attributes S Dk, , . . . , S Dk,m represent the strength of the node’s defenses against the m attack types and encodes the node’s security mechanisms,such as firewalls and encryption functions. Additionally, each node N k has a special defense attribute S Dk,m +1 , whose valuerepresents the node’s capability to detect an attack.As an example, installing an intrusion detection system—amonitoring mechanism—at node N k would correspond to anincrease of the detection value S Dk,m +1 in the model. Corre-spondingly, installing a rate-limiting mechanism—a securitymechanism against denial of service attacks—at node N k would correspond to an increase of the defense attribute S Dk,j ,where j is the attack type for denial of service attacks in themodel.The attacker can perform two types of actions on a visiblenode N k : a reconnaissance action, which renders the defensestatus S Dk visible to the attacker, or an attack of type j ∈ . . . m , which increases the attack value S Ak,j by . If theattack value exceeds the node’s defense value, i.e. S Ak,j >S Dk,j , the attacker has compromised the node and the node’sneighbors become visible to the attacker. Conversely, if theattacker performs an attack that does not compromise a node,the attack is detected by the defender with probability p = S Dk,m +1 w +1 , defined by the node’s detection capability S Dk,m +1 .Just like the attacker, the defender can perform two typesof actions on node N k : either a monitoring action, that im-proves the detection capability of the node and increments thedetection value S Dk,m +1 , or a defensive action, that improvesthe defense against attacks of type j ∈ . . . m and incrementsthe defense value S Dk,j .The actions are performed on a round-by-round basis in thegame. In each round, the attacker and the defender performone action each, which brings the system into a new state. Thegame ends either when the attacker compromises the targetnode, N data (attacker wins), or when the attacker is detected(defender wins). The winner of a game is rewarded with autility of +1 whereas the opponent receives a utility of − .The game evolves as a stochastic game with the Markovproperty, P [ s t +1 | s t ] = P [ s t +1 | s , . . . , s t ] , which followstrivially from the game dynamics defined above. The size ofthe state space S of the Markov game is ( w + 1) |N |· m · ( m +1) .Finally, the size of the action spaces A and A for both the3ttacker and the defender is |N | · ( m + 1) . C. Scenarios for the Intrusion Prevention Game
The Markov game defined above is parameterized by thegraph’s topology, the number of attack types m , the maximalattribute value w , and the initial node states S , . . . , S |N | .In the remainder of the paper we focus on three specificscenarios that are depicted in Fig. 2. We define the numberof attack types to be four ( m = 4 ), the maximal attributevalue w = 9 , and initialize all attack states to zero: S Ak,i = 0 for k ∈ . . . |N | and i ∈ . . . m . The three scenariosdiffer in how their defense states S Dk,i for k ∈ . . . |N | and i ∈ . . . m + 1 are initialized, as is illustrated in Fig. 2.Specifically, scenario and model infrastructures with strongdefenses but weak detection capabilities whereas scenario models an infrastructure with both weak defenses and weakdetection capabilities.IV. F INDING S ECURITY S TRATEGIES THROUGH R EINFORCEMENT L EARNING
This section describes our method for finding strategies forthe attacker and the defender in the game defined in SectionIII. Specifically, we describe how self-play in combinationwith reinforcement learning can be used to learn strategieswithout prior knowledge. We also discuss our approach toaddress the difficulties associated with the high-dimensionalstate and action spaces, as well as the challenges that comewith a non-stationary environment.
A. Learning Policies through Self-Play
To simulate games using the model defined in Section III,actions for the attacker and the defender are sampled frompolicies . In our approach, these policies are not defined byexperts, but evolve through the process of self-play, which weimplement as follows. First, we initialize both the attacker andthe defender with random policies. Next, we run a series ofsimulations where the agents play against each other usingtheir current policies. After a given number of games, weuse the game outcomes and the game trajectories to updatethe policies using reinforcement learning. This process ofplaying games and updating the agents’ policies continuesuntil both policies sufficiently converge. Although self-playoften converges in practice, as reported in [26], [12], [13],no formal guarantees for policy convergence in self-play havebeen established. The Challenge of a Non-Stationary Environment:
Onereason for the lack of convergence guarantees of self-play isthat the environment is non-stationary, which is an inherentchallenge for reinforcement learning [10], [11]. Specifically,when multiple agents learn simultaneously in self-play, theagents’ policies are part of the environment. This means thatfor each agent, the environment changes when the opponentupdates its policy. Hence, the environment is non-stationary.To improve the possibility of convergence in self-play, despite As we build on prior work in both reinforcement learning and game theorywe use the terms “strategy” and “policy” interchangeably. the non-stationary environment, different methods have beenproposed in prior work [12], [13]. We describe our approachto this issue in Section IV-E.
B. Reinforcement Learning Algorithm
To learn strategies for the attacker and the defender inthe game, we use the well-known reinforcement learningalgorithms REINFORCE [27] and PPO [21]. Both of themimplement the policy gradient method for solving the re-inforcement learning problem (Eq. 4). This method can besummarized as follows. First, the policy π is represented asa parameterized function π θ , where θ ∈ R d . Second, anobjective function J ( θ ) that estimates the expected rewardfor the policy π θ is introduced. For example, the expectedreward following policy π θ can be estimated by sampling fromthe environment, e.g. J ( θ ) = E o ∼ ρ πθ ,a ∼ π θ [ R ] , where ρ π θ isthe observation distribution and R is the cumulative episodereward. Third, the optimization problem of maximizing J ( θ ) is solved using stochastic gradient ascent with a variant of thefollowing gradient: ∇ θ J ( θ ) = E o ∼ ρ πθ ,a ∼ π θ ∇ θ log π θ ( a | o ) (cid:124) (cid:123)(cid:122) (cid:125) actor A π θ ( o, a ) (cid:124) (cid:123)(cid:122) (cid:125) critic (7)where A π θ ( o, a ) = Q π θ ( o, a ) − V π θ ( o ) is the advantagefunction, that gives an estimate of how advantageous the action a is when following policy π θ compared to the average action. C. Function Approximation
The Markov game described in Section III has a high-dimensional state space where the number of states growsexponentially with the number of nodes |N | and attack types m . Specifically, as each node N k ∈ N has m attack attributesand m + 1 defense attributes, whose values range from , . . . , w , the state space has size |S| = ( w + 1) |N |· m · ( m +1) .This means that tabular reinforcement learning methods thatrely on enumerating the entire state space are impractical.Instead, we use a parameterized function π θ as a more compactrepresentation of the policy than a table. In particular, weimplement π θ with a deep neural network that has a fixedset of parameters θ ∈ R d . The neural network takes asinput an observation o , and outputs a conditional probabilitydistribution π θ ( a | o ) over all possible actions a . Moreover,the neural network uses an actor-critic architecture [28] andcomputes a second output that estimates the value function V π θ ( o ) , which is used to estimate A π θ ( o, a ) in the gradientin Eq. 7. D. Autoregressive Policy Representation
Using reinforcement learning in environments with largeaction spaces is challenging and often results in slow con-vergence of policies [14]. In our case, the action space of theMarkov game from Section III grows linearly with the numberof nodes |N | and the number of attack types m . To dealwith this challenge, we leverage the structure of the problemdomain to reduce the size of the action space. Specifically,4 . . . . . . A tt a c k e r w i np r o b a b ili t y AttackerAgent vs
DefendMinimal , scenario REINFORCEPPO-ARPPO . . . . . A tt a c k e r w i np r o b a b ili t y AttackerAgent vs
DefendMinimal , scenario REINFORCEPPO-ARPPO . . . . . . A tt a c k e r w i np r o b a b ili t y AttackerAgent vs
DefendMinimal , scenario REINFORCEPPO-ARPPO . . . . . A tt a c k e r w i np r o b a b ili t y DefenderAgent vs
AttackMaximal , scenario REINFORCEPPO-ARPPO . . . . A tt a c k e r w i np r o b a b ili t y DefenderAgent vs
AttackMaximal , scenario REINFORCEPPO ARPPO . . . . . A tt a c k e r w i np r o b a b ili t y DefenderAgent vs
AttackMaximal , scenario REINFORCEPPO ARPPO
Fig. 3: Attacker win ratio against the number of training iterations; the top row shows the results from training the attackeragainst D
EFEND M INIMAL ; the bottom row shows the results from training the defender against A
TTACK M AXIMAL ; the threecolumns represent the three scenarios; the curve labeled PPO-AR shows the mean values of our proposed method; the resultsare averages over five training runs with different random seeds; the shaded regions show the standard deviation.we represent an action as a sequence of two sub-actions: (1)select the node n to attack or defend; and (2) select the typeof attack or defense a . As a consequence, we can decomposethe policy π ( a, n | o ) in two sub-policies, π ( a | n, o ) and π ( n | o ) based on the chain rule of probability (Eq. 8). This reduces thesize of the action space of both the attacker and the defenderfrom the size |N | · ( m + 1) to |N | + ( m + 1) . π ( a, n | o ) = π ( a | n, o ) · π ( n | o ) (8)We implement each sub-policy with a neural network. First, π θ ( n | o ) takes an observation o as input and decides the node n to attack or defend. Next, the second neural network π φ ( a | n, o ) focuses on the selected node n to identify vulnerabilities thatcan be exploited or patched with an attack or defense of type a . Hence, the action is sampled in an autoregressive mannerfrom the two sub-policies. E. Opponent Pool
Although prior works have demonstrated that self-play canbe a powerful learning method [12], [13], it is also well-known that self-play can suffer from training instability, whichprevents the agents’ policies from converging [29], [12], [13].Specifically, it can occur that an agent adjusts its policy toits current opponent to such a degree that the policy overfitsand becomes ineffective against other policies. Our approachto mitigate this problem is to sample opponent policies from a pool of policies during training [12], [13]. The opponent poolincreases the diversity of policies, and as a result, reducesthe chance of overfitting. To populate the pool, we add theagents’ current attack and defense policies periodically to thepool during training. V. L
EARNING E FFECTIVE S ECURITY S TRATEGIES
In this section, we evaluate the method presented in Sec-tion IV for finding security strategies for the use case ofintrusion prevention. We train reinforcement learning agentsusing simulations of self-play and evaluate our method withrespect to convergence of agent policies. We also compareits performance with those of baseline reinforcement learningalgorithms and inspect the learned policies. In sub-section V-Awe train the attacker and the defender against static opponents,and in sub-section V-B we train both the attacker and thedefender simultaneously.Our implementation of the game model and the agents’policies is open source and available at https://github.com/Limmen/gym-idsgame . The simulations have beenconducted using a Tesla P100 GPU and hyperparameters arelisted in Appendix A.
A. Learning Policies Against a Static Opponent
We first examine whether our method can discover effectiveattack and defense policies against a static opponent policy.This means that we keep one policy fixed in self-play and thatthe other agent learns its policy against a static opponent. Asone agent’s policy is fixed, the environment for the other agentis stationary, which simplifies policy convergence.
1) Static Opponent Policies:
We investigate two cases oflearning against a static opponent. In the first case, the attackerlearns its policy by playing against a static defender policycalled D
EFEND M INIMAL . In the second case, the defenderlearns its policy by playing against a static attacker policycalled A
TTACK M AXIMAL . D
EFEND M INIMAL is a heuristicpolicy that updates the attribute with the minimal defense valueacross all nodes. Similarly, A
TTACK M AXIMAL is a heuristicpolicy that updates the attribute with the maximal attack value5 /0 ?/0 ?/0 ?/0?/0?/0?/0 ?/0 ?/0 4/0 0/0 8/0 ?/0?/0?/0?/0 ?/0 ?/0
Attacker Attack path N start N i N data Attack type (D/A)
Attack action Reconnaissance action
State t State t State t State t State t Fig. 4: An illustration of a learned attack strategy, evolving from left to right. The attacker first scans a neighboring node forvulnerabilities (low defense attributes) (state t ). The attacker then exploits the found vulnerability (state t ), compromises thenode, and scans the target node N data (state t ). Finally, the attacker completes the intrusion by attacking N data (state t ).across all nodes that are visible to the attacker. If severalminimal or maximal values exists, D EFEND M INIMAL andA
TTACK M AXIMAL picks the attribute to update at random.
2) Simulation Setup:
To evaluate our method, we runsimulations of the three scenarios described in Section III-C.Before each simulation run, we initialize the model witha random permutation of the defense attributes to preventoverfitting (see Fig. 2). In addition to the three scenarios,we evaluate three algorithms: our proposed method, PPO-AR, that uses an autoregressive policy representation (seeSection IV-D), and two baseline methods, regular PPO andREINFORCE. Apart from the different algorithms and policyrepresentations, all three methods use the same training setupand hyperparameters (Appendix A). For each scenario andalgorithm, we run , training iterations, each iterationconsisting of , game rounds. After each training iteration,the non-static agent updates its policy. Moreover, to measurestability we run each simulation five times with differentrandom seeds, where a single simulation run accounts for hours of training time on a P100 GPU. The average resultsare shown in Fig. 3.
3) Analysis of the results in Fig. 3:
The top row of Fig. 3shows the results from training the attacker against the staticdefender, and the bottom row shows the results from trainingthe defender against the static attacker. The three columnsrepresent the three scenarios. The x-axes denote the trainingiterations and the y-axes denote the attacker’s empirical winratio, calculated as the win ratio for the attacker in evaluation games.The top row of Fig. 3 demonstrates that the attacker quicklyadjusts its policy to increase its win ratio. Specifically, we canobserve that the policies converge after about iterations.Furthermore, we can see that our method, PPO-AR, achievesa higher win ratio than the two baselines in all three scenarios.It can also be seen that among the baselines, PPO outperformsREINFORCE.Looking at the bottom row of Fig. 3, we observe, similarto the top row, that the defender’s policy converges quickly,although the results tend to have a higher variance. Moreover, we see that our proposed method, PPO-AR, outperforms thebaselines and achieves the lowest average win ratio of theattacker for all three scenarios. We also observe that PPOoutperforms REINFORCE.That PPO-AR outperforms the baselines indicates that thereduced action space that results from the autoregressive policyrepresentation simplifies exploration, enabling the agents todiscover more effective policies. Moreover, although we ex-pected that PPO would do better than REINFORCE, as it usesa more theoretically justified policy update [21], it is surprisingthat the difference is so substantial in the results; we anticipatethat it could be due to wrong setting of hyperparameters.In the three columns of Fig. 3 we see that the similaritiesand the differences among the three scenarios (see SectionIII-C) are visible in the results. Specifically, that the winratio of the attacker is lower in scenario and than inscenario reflects the fact that scenario and have stronginitial defenses, whereas scenario has weak initial defenses.Similarly, that scenario and scenario have alike defensescan be seen on the similar curves in the two leftmost columnsof Fig. 3. Moreover, the weak initial defense of scenario can be seen in the high variance of our method in scenario (Fig. 3, bottom row, right column). As all nodes are equallyvulnerable in scenario , the static attacker can select anyattack path. Consequently, the defender learns a defense policythat predicts a diverse set of attacks, leading to a high variance.
4) Inspection of Learned Policies:
We find that the learned attack policies are deterministic and generally consist of twosteps: (1) perform reconnaissance to collect information aboutneighboring nodes; and (2), exploit identified vulnerabilities(Fig. 4). Furthermore, we find that the learned defense policiesalso are mostly deterministic and consist of hardening thecritical node N data , and patching identified vulnerabilitiesacross all nodes. Finally, we also find that the defender learnsto use the predictable attack pattern to its advantage anddefends where the attacker is likely to attack next.6 . . . . A tt a c k e r w i np r o b a b ili t y Attacker vs Defender self-play training, scenario REINFORCEPPO-ARPPO . . . . A tt a c k e r w i np r o b a b ili t y Attacker vs Defender self-play training, scenario REINFORCEPPO-ARPPO . . . . A tt a c k e r w i np r o b a b ili t y Attacker vs Defender self-play training, scenario REINFORCEPPO-ARPPO
Fig. 5: Attacker win ratio against the number of training iterations; the three sub-graphs show the results from training theattacker and the defender in self-play for the three scenarios; the curve labeled PPO-AR shows the mean values of our proposedmethod; the results are averages over five training runs with different random seeds; the shaded regions show the standarddeviation.
B. Learning Policies Against a Dynamic Opponent
In this subsection we study the setting where the attackerand the defender learn simultaneously. Our goal is to inves-tigate whether this setting can lead to stable and effectivepolicies.
1) Simulation Setup:
To evaluate our method, we usethe same setup as in Section V-A and run , trainingiterations, each with five different random seeds. After eachiteration, both agents update their policies. Hence, this setupis different from Section V-A, where one agent is static andonly one agent updates its policy after each iteration. Thatboth agents update their policies means that, in each iteration,an agent plays against a different opponent policy, making theenvironment non-stationary. Consequently, the main challengein this setting is to get the agents’ policies to converge. Toimprove convergence, we use the opponent pool techniquedescribed in Section IV-E. The results are shown in Fig. 5.
2) Analysis of the Results in Fig. 5:
We observe in Fig.5 that policies sometimes converge and sometimes oscillateacross iterations. We also see that this behavior is dependenton the particular scenario. For instance, we find that scenario , which represents an infrastructure with good defenses (seeSection III-C), exhibits stable behavior of the policies. Incontrast, scenario , which captures an infrastructure withweak defenses, exhibits oscillation of the policies.Looking at our proposed method, PPO-AR, we can see thatit stabilizes in scenario and after some , iterations.However, in scenario our method oscillates with increasingamplitude. The oscillation indicates that the agent reacts to thechanged policy of its opponent and overfits. Similar behaviorof self-play has been observed in related works [30], [13].Although the opponent pool technique has been introduced tomitigate the oscillation, it is clear from the results that this isinsufficient in some scenarios.
3) Inspection of Learned Policies:
When inspecting thelearned policies, we find that they are fully stochastic. Thiscontrasts with the policies inspected in Section V-A, that wefound to be deterministic. By fully stochastic we mean thatin a specific state of the game, the policy suggests severalactions, each with significant probability. This result indicatesthat although deterministic policies are effective against static opponents, they are ineffective when facing adaptive oppo-nents that can exploit the deterministic policy [10], [29]. Forexample, an adaptive attacker can learn to circumvent a deter-ministic defense policy. Finally, apart from being stochastic,the converged policies are similar to strategies of humans.In particular, we find that our method discovered stochasticdefense policies that combine threat identification with attackforecasting and stochastic attack policies that combine recon-naissance with targeted exploits.
C. Discussion of the Results
The results demonstrate that, for a simple infrastructure,our method outperforms two baselines and discovers effectivestrategies for intrusion prevention without prior knowledge.When one agent is static in self-play, convergence is stable(Fig. 3) and the agents converge to deterministic policies thatare effective against the static opponents. Conversely, whenboth agents learn simultaneously in self-play, convergencedepends on the scenario (Fig. 5) and the agents converge tostochastic policies that are more general.When inspecting the learned policies, we find that thepolicies reflect common-sense knowledge and are similarto strategies of humans. Most notably, these policies werediscovered using no prior knowledge. This indicates thatself-play works as a defense hardening technique. As thedefender improves its defense in self-play, the attacker learnsincreasingly sophisticated attacks to circumvent the defense,leading to an artificial arms-race where both agents improvetheir policies over time.In this paper, we have studied a relatively small and simpleinfrastructure (Fig. 1) and the application of our method tomore complex—and more realistic—infrastructures is left forfuture work. We expect that by adding more complexity tothe model, our method will be able to discover increasinglysophisticated strategies. However, we also expect that trainingtimes and computational requirements will increase.Finally, the results also show that convergence of policies inself-play with two adaptive agents remains a challenge. In ourresults, we observe that the stability of policy convergencediffers between the three scenarios (Fig. 5), indicating thatthe characteristic of the environment is an important factorfor convergence in self-play. We also note that all policies7hat did converge in self-play with two adaptive agents werestochastic, which suggests that using methods that can repre-sent stochastic policies is important for policy convergence inself-play. VI. R
ELATED W ORK
Our research extends prior work on reinforcement learningapplied to network security, game-learning programs, andgame theoretic modeling of security.
A. Reinforcement Learning in Network Security
The research on reinforcement learning applied to networksecurity is in its infancy. Some early works are [25], [31], [2],[32]. For a literature review of deep reinforcement learning innetwork security, see [9].The prior work that most resembles ours are [25], and [32].Both of them experiment with abstract models for networkintrusion and defense. Our work can be seen as a directextension of these works as we use a similar model for ourstudy. Our work in this paper differs in the following ways: (1)we propose an extension of the models in [25], [32] to includereconnaissance; (2) we analyze learned security strategies inself-play; (3) we propose a reinforcement learning methodthat uses function approximation, an opponent pool, and anautoregressive policy representation; and (4), we extend priorwork to consider state-of-the-art algorithms (e.g. PPO [21]).
B. Game-Learning Programs
Games have been studied since the inception of artificial in-telligence and machine learning. An early example is Samuel’sCheckers player [33]. Other notable game-learning programsare Deep Blue [34] and TD-Gammon [26]. Recent develop-ments include AlphaGo [12], OpenAI Five [13], DeepStack[35], and Libratus [36].Although we take inspiration from these works, our studydiffers by focusing on security strategies as opposed to puregame strategies. Moreover, even though our study focuses onself-play training, we also take inspiration from prior workthat have investigated learning theory in games, in particularMinimax-Q [23], Nash-Q [37], Neural Fictitious Self-Play[29], and the seminal book by Fudenberg and Levine [38].
C. Game Theoretic Modeling in Cyber Security
Several examples of game theoretic security models can befound in the literature [5], [39], [4]. For example, FlipIt [39]is a security game that models Advanced Persistent Threats(APTs). Other notable works include the book by Alpcanand Basar [4] and the survey by Manshaei et al. [5]. Weconsider the related work on game theoretic methods to becomplementary to the reinforcement learning methods studiedin this paper.VII. C
ONCLUSION AND F UTURE W ORK
We have proposed a method to automatically find securitystrategies for the use case of intrusion prevention. The methodconsists of modeling the interaction between an attacker anda defender as a Markov game and letting attack and defense strategies evolve through reinforcement learning and self-play without human intervention. We have also addressedknown challenges of reinforcement learning in this domainand presented an approach that uses function approximation,an opponent pool, and an autoregressive policy representation.Using a simple infrastructure configuration, we have demon-strated that effective security strategies can emerge fromself-play. Inspection of the converged policies show that theemerged policies reflect common-sense knowledge and aresimilar to strategies of humans (e.g. Fig. 4). Moreover, throughevaluations, we have demonstrated that our method is superiorto two baseline methods (Fig. 3), but that the problem ofoscillating policies in self-play remains a challenge (Fig. 5). Inparticular, we have found a scenario where our method doesnot lead to converging but oscillating policies.In future work we plan to further study the observed policyoscillations and explore techniques for mitigation. We alsoplan to increase the scale of the intrusion prevention usecase and model more realistic infrastructures, e.g. a campusnetwork. Finally, we will consider two extensions to ourmodel. First, we plan to enrich the model’s action space tocover additional use-cases, e.g. Advanced Persistent Threats(APTs). Second, we plan to lower the level of abstraction ofthe model to examine if our method can discover practicallyuseful security strategies, i.e. strategies related to specificnetwork protocols and software.VIII. A
CKNOWLEDGMENTS
This research is supported in part by the Swedish armedforces and was conducted at KTH Center for Cyber Defenseand Information Security (CDIS). The authors would like tothank Pontus Johnson, Professor at KTH, for useful input tothis research, and Rodolfo Villac¸a and Forough Shahab Samanifor their constructive remarks.A
PPENDIX
A. Hyperparameters
Parameter Value
Discount factor γ . Learning rate α . Batch size
Entropy coefficient . GAE λ . Clipping range (cid:15) . Optimizer Adam [40]Opponent pool max size 100000Opponent pool increment iterations 50Opponent pool samp´le p . TABLE 1: Hyperparameters.R
EFERENCES[1] M. A. Maloof,
Machine Learning and Data Mining for Computer Secu-rity: Methods and Applications (Advanced Information and KnowledgeProcessing) . Berlin, Heidelberg: Springer-Verlag, 2005.
2] K. Durkota, V. Lisy, B. Boˇsansky, and C. Kiekintveld, “Optimalnetwork security hardening using attack graph games,” in
Proceedingsof the 24th International Conference on Artificial Intelligence , ser.IJCAI’15. AAAI Press, 2015, pp. 526–532. [Online]. Available:http://dl.acm.org/citation.cfm?id=2832249.2832322[3] P. Johnson, R. Lagerstr¨om, and M. Ekstedt, “A meta language forthreat modeling and attack simulations,” in
Proceedings of the 13thInternational Conference on Availability, Reliability and Security , ser.ARES 2018. New York, NY, USA: Association for ComputingMachinery, 2018. [Online]. Available: https://doi.org/10.1145/3230833.3232799[4] T. Alpcan and T. Basar,
Network Security: A Decision and Game-Theoretic Approach , 1st ed. USA: Cambridge University Press, 2010.[5] M. H. Manshaei, Q. Zhu, T. Alpcan, T. Basar, and J.-P. Hubaux,“Game theory meets network security and privacy,”
ACM Comput.Surv. , vol. 45, no. 3, pp. 25:1–25:39, Jul. 2013. [Online]. Available:http://doi.acm.org/10.1145/2480741.2480742[6] R. Bronfman-Nadas, N. Zincir-Heywood, and J. T. Jacobs, “An artificialarms race: Could it improve mobile malware detectors?” in , 2018,pp. 1–8.[7] S. Noreen, S. Murtaza, M. Z. Shafiq, and M. Farooq, “Evolvablemalware,” in
Proceedings of the 11th Annual Conference on Geneticand Evolutionary Computation , ser. GECCO ’09. New York, NY,USA: Association for Computing Machinery, 2009, p. 1569–1576.[Online]. Available: https://doi.org/10.1145/1569901.1570111[8] R. Sommer and V. Paxson, “Outside the closed world: On usingmachine learning for network intrusion detection,” in
Proceedings ofthe 2010 IEEE Symposium on Security and Privacy , ser. SP ’10.USA: IEEE Computer Society, 2010, p. 305–316. [Online]. Available:https://doi.org/10.1109/SP.2010.25[9] T. T. Nguyen and V. J. Reddi, “Deep reinforcement learning forcyber security,”
CoRR , vol. abs/1906.05799, 2019. [Online]. Available:http://arxiv.org/abs/1906.05799[10] M. Bowling and M. Veloso, “Scalable learning in stochastic games,” in
AAAI Workshop on Game Theoretic and Decision Theoretic Agents , July2002.[11] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multi-agent reinforcement learning via minimax deep deterministic policygradient,”
Proceedings of the AAAI Conference on Artificial Intelligence ,vol. 33, pp. 4213–4220, 07 2019.[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc-tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,“Mastering the game of Go with deep neural networks and tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, Jan. 2016.[13] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison,D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. J´ozefowicz, S. Gray,C. Olsson, J. W. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman,T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang,F. Wolski, and S. Zhang, “Dota 2 with large scale deep reinforcementlearning,”
ArXiv , vol. abs/1912.06680, 2019.[14] O. Vinyals, I. Babuschkin, W. Czarnecki, M. Mathieu, A. Dudzik,J. Chung, D. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan,M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. Agapiou,M. Jaderberg, and D. Silver, “Grandmaster level in starcraft ii usingmulti-agent reinforcement learning,”
Nature , vol. 575, 11 2019.[15] R. Bellman, “A markovian decision process,”
Journal of Mathematicsand Mechanics
Dynamic Programming and Markov Processes . Cam-bridge, MA: MIT Press, 1960.[17] W. S. Lovejoy, “A survey of algorithmic methods for partiallyobserved markov decision processes,”
Annals of Operations Research ,vol. 28, no. 1, pp. 47–65, Dec 1991. [Online]. Available: https://doi.org/10.1007/BF02055574[18] R. S. Sutton and A. G. Barto,
Introduction to Reinforcement Learning ,1st ed. Cambridge, MA, USA: MIT Press, 1998.[19] R. Bellman,
Dynamic Programming . Dover Publications, 1957.[20] C. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, 011989. [21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,”
CoRR , vol. abs/1707.06347,2017. [Online]. Available: http://arxiv.org/abs/1707.06347[22] L. S. Shapley, “Stochastic games,”
Proceedings of the NationalAcademy of Sciences
Proceedings of the Eleventh InternationalConference on International Conference on Machine Learning , ser.ICML’94. San Francisco, CA, USA: Morgan Kaufmann PublishersInc., 1994, p. 157–163.[24] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multi-agent reinforcement learning via minimax deep deterministic policygradient,”
Proceedings of the AAAI Conference on Artificial Intelligence ,vol. 33, pp. 4213–4220, 07 2019.[25] R. Elderman, L. J. J. Pater, A. S. Thie, M. M. Drugan, and M. Wiering,“Adversarial reinforcement learning in a cyber security simulation,” in
ICAART , 2017.[26] G. Tesauro, “Td-gammon, a self-teaching backgammon program,achieves master-level play,”
Neural Comput. , vol. 6, no. 2, p. 215–219,Mar. 1994. [Online]. Available: https://doi.org/10.1162/neco.1994.6.2.215[27] R. J. Williams, “Reinforcement-learning connectionist systems,” 1987.[28] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,”in
Proceedings of the International Conference on Learning Represen-tations (ICLR) , 2016.[29] J. Heinrich and D. Silver, “Deep reinforcement learning from self-playin imperfect-information games,”
CoRR , vol. abs/1603.01121, 2016.[Online]. Available: http://arxiv.org/abs/1603.01121[30] D. Hernandez, K. Denamgana¨ı, Y. Gao, P. York, S. Devlin, S. Samoth-rakis, and J. A. Walker, “A generalized framework for self-play training,”in , 2019, pp. 1–8.[31] K. Chung, C. A. Kamhoua, K. A. Kwiat, Z. T. Kalbarczyk, and R. K.Iyer, “Game theory with learning for cyber security monitoring,” in
Proceedings of the 2016 IEEE 17th International Symposium on HighAssurance Systems Engineering (HASE) , ser. HASE ’16. Washington,DC, USA: IEEE Computer Society, 2016, pp. 1–8. [Online]. Available:http://dx.doi.org/10.1109/HASE.2016.48[32] A. Ridley, “Machine learning for autonomous cyber defense,” 2018, theNext Wave, Vol 22, No.1 2018.[33] A. L. Samuel, “Some studies in machine learning using the game ofcheckers,”
IBM J. Res. Dev. , vol. 3, no. 3, p. 210–229, Jul. 1959.[Online]. Available: https://doi.org/10.1147/rd.33.0210[34] M. Campbell, A. J. Hoane, and F.-h. Hsu, “Deep blue,”
Artif.Intell. , vol. 134, no. 1–2, p. 57–83, Jan. 2002. [Online]. Available:https://doi.org/10.1016/S0004-3702(01)00129-1[35] M. Moravˇc´ık, M. Schmid, N. Burch, V. Lis´y, D. Morrill, N. Bard,T. Davis, K. Waugh, M. Johanson, and M. Bowling, “Deepstack:Expert-level artificial intelligence in heads-up no-limit poker,”
Science ,vol. 356, no. 6337, pp. 508–513, 2017. [Online]. Available:https://science.sciencemag.org/content/356/6337/508[36] N. Brown and T. Sandholm, “Superhuman ai for heads-up no-limitpoker: Libratus beats top professionals,”
Science , vol. 359, no. 6374,pp. 418–424, 2018. [Online]. Available: https://science.sciencemag.org/content/359/6374/418[37] J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochasticgames,”
J. Mach. Learn. Res. , vol. 4, no. null, p. 1039–1069, Dec. 2003.[38] D. Fudenberg and D. Levine,
The Theory of Learning in Games ,1st ed. The MIT Press, 1998, vol. 1. [Online]. Available:https://EconPapers.repec.org/RePEc:mtp:titles:0262061945[39] M. van Dijk, A. Juels, A. Oprea, and R. L. Rivest, “Flipit: Thegame of “stealthy takeover”,”
Journal of Cryptology , vol. 26, no. 4,pp. 655–713, Oct 2013. [Online]. Available: https://doi.org/10.1007/s00145-012-9134-5[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”2014, cite arxiv:1412.6980Comment: Published as a conference paperat the 3rd International Conference for Learning Representations, SanDiego, 2015. [Online]. Available: http://arxiv.org/abs/1412.6980, vol. 26, no. 4,pp. 655–713, Oct 2013. [Online]. Available: https://doi.org/10.1007/s00145-012-9134-5[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”2014, cite arxiv:1412.6980Comment: Published as a conference paperat the 3rd International Conference for Learning Representations, SanDiego, 2015. [Online]. Available: http://arxiv.org/abs/1412.6980