[PDF] Learning to Play against Any Mixture of Opponents

Abstract

Intuitively, experience playing against one mixture of opponents in a given domain should be relevant for a different mixture in the same domain. We propose a transfer learning method, Q-Mixing, that starts by learning Q-values against each pure-strategy opponent. Then a Q-value for any distribution of opponent strategies is approximated by appropriately averaging the separately learned Q-values. From these components, we construct policies against all opponent mixtures without any further training. We empirically validate Q-Mixing in two environments: a simple grid-world soccer environment, and a complicated cyber-security game. We find that Q-Mixing is able to successfully transfer knowledge across any mixture of opponents. We next consider the use of observations during play to update the believed distribution of opponents. We introduce an opponent classifier -- trained in parallel to Q-learning, using the same data -- and use the classifier results to refine the mixing of Q-values. We find that Q-Mixing augmented with the opponent classifier function performs comparably, and with lower variance, than training directly against a mixed-strategy opponent.

Full PDF

LLearning to Play against Any Mixture of Opponents

Max Olan Smith

University of Michigan [email protected]

Thomas Anthony

DeepMind [email protected]

Yongzhao Wang

University of Michigan [email protected]

Michael P. Wellman

University of Michigan [email protected]

Abstract

Intuitively, experience playing against one mixture of opponents in a given domainshould be relevant for a different mixture in the same domain. We propose atransfer learning method,

Q-Mixing , that starts by learning Q-values against eachpure-strategy opponent. Then a Q-value for any distribution of opponent strategiesis approximated by appropriately averaging the separately learned Q-values. Fromthese components, we construct policies against all opponent mixtures withoutany further training. We empirically validate Q-Mixing in two environments: asimple grid-world soccer environment, and a complicated cyber-security game. Weﬁnd that Q-Mixing is able to successfully transfer knowledge across any mixtureof opponents. We next consider the use of observations during play to update thebelieved distribution of opponents. We introduce an opponent classiﬁer—trainedin parallel to Q-learning, using the same data—and use the classiﬁer results toreﬁne the mixing of Q-values. We ﬁnd that Q-Mixing augmented with the opponentclassiﬁer function performs comparably, and with lower variance, than trainingdirectly against a mixed-strategy opponent.

Reinforcement learning (RL) agents commonly interact in environments with other agents, whosebehavior may be uncertain. For any particular probabilistic belief over the behavior of another agent(henceforth, opponent ), we can learn to play with respect to that opponent distribution (henceforth, mixture ), for example by training in simulation against opponents sampled from the mixture. If themixture changes, ideally we would not have to train from scratch, but rather could transfer what wehave learned to construct a policy to play against the new mixture.Traditional RL algorithms include no mechanism to explicitly prepare for variability in opponentmixtures. Instead, current solutions either learn a new behavior, or update a previously learned behavior.In lieu of that, researchers designed methods for learning a single behavior successfully across a set ofstrategies [52], or quickly adapting in response to new strategies [21]. In this work, we explicitly tacklethe problem of responding to new opponent mixtures without requiring further simulations for learning.We propose a new algorithm,

Q-Mixing , that effectively transfers learning across opponent mixtures.The algorithm is designed within a population-based learning regime [21, 28], where training isconducted against a known distribution of opponent policies. Q-Mixing initially learns value-basedbest responses (BR), represented as Q-functions, with respect to each of the opponent’s pure strategies.It then transfers this knowledge against any given opponent mixture by weighting the Q-functionsaccording to the probability that the opponent plays each of its pure strategies. The idea is illustratedin Figure 1. This calculation is an approximation of the Q-values against the mixed-strategy opponent, a r X i v : . [ c s . M A ] S e p ith error due to misrepresenting the future belief of the opponent by the current belief. The resultis an approximate BR policy against any mixture, constructed without additional training.Figure 1: Q-Mixing concept. BR to each of the opponent’s pure strategies π − i are learned separately.The Q-value for a given opponent mixed strategy σ − i is then derived by combining these components.We experimentally validate our algorithm on: (1) a simple grid-world soccer game, and (2) acomplicated cyber-security game. Our experiments show that Q-Mixing is effective in transferringlearned responses across opponent mixtures.We also address two potential issues with combining Q-functions according to a given mixture. The ﬁrstis that the agent receives information during play that provides evidence about the opponent strategy.We address this by introducing an opponent classiﬁer to predict which opponent we are playing for aparticular state, and reweighting the Q-values to focus on the likely opponents. The second issue is thatthe complexity of the policy produced by Q-Mixing grows linearly with the support of the opponent’smixture. This can make simulation and decision making computationally expensive. We propose usingpolicy distillation [40] as a method for compressing a complex Q-Mixing policy. Our experimentsshow that the compressed policy is able to recover the full performance of the Q-Mixing policy. Key Contributions:

1. Theoretically relate (in idealized setting) the Q-value for an opponent mixture to Q-valuesfor mixture components.2. A new transfer learning algorithm,

Q-Mixing , that uses this relationship to constructapproximate policies against any given opponent mixture, without additional training.3. Augmenting this algorithm with runtime opponent classiﬁcation, to account for observationsthat can inform predictions of opponent policy during play.

At any particular time t ∈ T , an agent receives the environment’s state s t ∈ S , or an observation o t ∈ O ,a partial state. (Even if an environment’s state is fully observable, the inclusion of other agents withuncertain behavior makes the overall system partially observable.) The agent then takes an action a t ∈ A based on that observation, and receives a reward r t ∈ R . The agent’s policy describes itsbehavior given each observation π : O → ∆( A ) . Actions are received by the environment, and a nextobservation is determined following the environment’s transition dynamics p : O×A → O .The agent’s goal is to maximize its reward over time. This quantity is called the return : G t = (cid:80) ∞ l γ l r t + l ,where γ is the discount factor weighting the importance of immediate rewards. Return is used to deﬁnethe values of being in a given observation state: V ( o t ) = E π (cid:34) ∞ (cid:88) l =0 γ l r ( o t + l ,a t + l ) (cid:35) , and taking an action given an observation: Q ( o t ,a t ) = r ( o t ,a t )+ γ E o t +1 ∈O (cid:2) V ( o t +1 ) (cid:3) . For multiagent settings, we index the agents and distinguish the agent-speciﬁc components withsubscripts. Agent i ’s policy is π i : O i → ∆( A i ) , and the policy for the opponent is the negated index, π − i : O − i → ∆( A − i ) . Boldface denotes elements that are joint across agents (e.g., joint-action a ). Our methods are deﬁned here for environments with two agents. Extension to greater numbers whilemaintaining computational tractability is a topic for future work. i has a strategy set Π i comprising the possible policies it can employ. Agent i may choosea single policy from Π i to play as a pure strategy , or randomize with a mixed strategy σ i ∈ ∆(Π i ) .Note that the pure strategies π i may themselves choose actions stochastically. For a mixed strategy,we denote the probability the agent plays a particular policy π i as σ i ( π i ) . A best response (BR) toan opponent’s strategy σ − i is a policy with maximum return against σ − i .Agent i ’s prior belief about its opponent is represented by an opponent mixed-strategy, σ − i ≡ σ − i .The opponent plays the mixture by sampling a policy according to σ − i at the start of the episode. Theagent’s updated belief at time t about the opponent policy faced is denoted σ t − i .We introduce the term Strategy Response Value (SRV) to refer to the observation-value against aparticular opponent’s strategy.

Deﬁnition 1 (Strategic Response Value) . An agent’s π i strategic response value is its expected returngiven an observation, when playing π i against a speciﬁed opponent strategy: V π i ( o ti | σ t − i ) = E σ t − i (cid:88) a πππ ( a i | o i ) (cid:88) o (cid:48) i ,r i p ( o (cid:48) i ,r i | o i , a ) · δ  , where δ ≡ r i + γV π i ( o (cid:48) i | σ t +1 − i ) .Let the optimal SRV be V ∗ i ( o ti | σ t − i ) = max π i V π i ( o ti | σ t − i ) . From the SRV, we deﬁne the Strategic Response Q-Value (SRQV) for a particular opponent strategy.

Deﬁnition 2 (Strategic Response Q-Value) . An agent’s π i strategic response Q-value is its expectedreturn for an action given an observation, when playing π i against a speciﬁed opponent strategy: Q π i ( o ti ,a ti | σ t − i ) = E σ t − i (cid:2) r ti (cid:3) + γ E o t +1 i (cid:2) V π i ( o t +1 i | σ t +1 − i ) (cid:3) , where r ti ≡ r i ( o ti ,a ti ,a t − i ) .Let the optimal SRQV be Q ∗ i ( o ti ,a ti | σ t − i ) = max π i Q π i ( o ti ,a ti | σ t − i ) . Our goal is to transfer the Q-learning effort across different opponent mixtures. We consider the scenariowhere we ﬁrst learn against each opponent’s pure strategy. From this, we construct a Q-function fora given distribution of opponents from the policies trained against each opponent’s pure strategy.

Let us ﬁrst consider a simpliﬁed setting with a single state. This is essentially a problem of banditlearning, where our opponent’s strategy sets the rewards of each arm for an episode. Intuitively, ourexpected reward against a mixture of opponents is proportional to the payoff against each opponentweighted by their respective likelihood.As shown in Theorem 1, weighting the component SRQV by the opponent’s distribution supportsa BR to that mixture. We call this relationship

Q-Mixing: Prior and deﬁne it as follows: Q ∗ i ( a i | σ − i ) = (cid:88) π − i ∈ Π − i σ − i ( π − i ) Q ∗ i ( a i | π − i ) . Theorem 1 (Single-State Q-Mixing) . Let Q ∗ i ( · | π − i ) , π − i ∈ Π − i , denote the optimal strategicresponse Q-value against opponent policy π − i . Then for any opponent mixture σ − i ∈ ∆(Π − i ) , theoptimal strategic response Q-value is given by Q ∗ i ( a i | σ − i ) = (cid:88) π − i ∈ Π − i σ − i ( π − i ) Q ∗ i ( a i | π − i ) . roof. The deﬁnition of Q-value is as follows [46]: Q ∗ i ( a i ) = (cid:88) r i p ( r i | a i ) · r i . In a multiagent system, the dynamics model p suppresses the complexity introduced by the otheragents. We can unpack the dynamics model to account for the other agents as follows: p ( r i | a i ) = (cid:88) π − i (cid:88) s − i (cid:88) a − i π − i ( a − i ) · p ( r i | a ) . We can then unpack the strategic response value as follows: Q ∗ i ( a i | π − i ) = (cid:88) a − i π − i ( a − i ) (cid:88) r i p ( r i | a ) · r i . Now we can rearrange the expanded Q-value to explicitly account for the opponent’s strategy. Theindependence assumption enables the following re-writing by letting us treat the opponent’s mixedstrategy as a constant condition. Q ∗ i ( a i | σ − i ) = (cid:88) r i (cid:88) π − i σ − i ( π − i ) (cid:88) a − i π − i ( a − i ) p ( r i | a ) · r i = (cid:88) π − i σ − i ( π − i ) (cid:88) a − i π − i ( a − i ) (cid:88) r i p ( r i | a ) · r i = (cid:88) π − i σ − i ( π − i ) Q ∗ i ( a i | π − i ) . Next we consider the RL setting where both agents are able to inﬂuence an evolving observation distri-bution. As a result of the joint effect of agents’ actions on observations, the agents have an opportunityto gather information about their opponent during an episode. Methods in this setting need to (1) useinformation from the past to update its belief about the opponent, and (2) grapple with uncertaintyabout the future. To deal with the past, the agent maintains a belief σ t − i about the opponent’s policy, ψ i : O ti → ∆(Π − i ) . To bring Q-Mixing into this setting we need to quantify the agent’s current beliefabout their opponent and their future uncertainty. Let ψ represent the agent’s current belief about theopponent’s policy. This quantity replaces the prior σ − i that appears in the Q-Mixing: Prior deﬁnition.During a run with an opponent’s pure strategy drawn from a distribution, the actual observationsexperienced generally depend on the identity of this pure strategy. With this idea in mind, we proposean approximate version of Q-Mixing that accounts for past information. The approximation worksby ﬁrst predicting the relative likelihood of each opponent given the current observation. From thisprediction, we weight the Q-value-based BRs against each opponent by their relative likelihood. Thisapproximation does not consider any future uncertainty about the opponent (blue area in Figure 2).Let the previously deﬁned ψ be the opponent policy classiﬁer (OPC), which predicts the opponent’spolicy. In this work we consider a simpliﬁed version of this function, that operates only on the agent’scurrent observation ψ i : O i → ∆(Π − i ) . We then augment Q-Mixing to weight the importance of eachBR as follows: Q π i ( o i ,a i | σ − i ) = (cid:88) π − i ψ i ( π − i | o i ,σ − i ) Q π i ( o i ,a i | π − i ) . We refer to this quantity as

Q-Mixing , or

Q-Mixing: X where X describes ψ . By continually updating theopponent distribution during play, the adjusted Q-Mixing result better responds to the actual opponent.An ancillary beneﬁt of the opponent classiﬁer is that poorly estimated Q-values tend to have theirimpact minimized. For example, if an observation occurs only against the second pure strategy, then theQ-value against the ﬁrst pure strategy would not be trained well, and thus could distort the policy fromQ-Mixing. These poorly trained cases correspond to unlikely opponents, and get reduced weightingin the version of Q-Mixing augmented by the classiﬁer.4 n ce r t a i n t y Time t t+1 σ -i0 π -i σ -it Figure 2: Opponent uncertainty over time. The yellow area represents uncertainty reduction as a resultof updating belief about the distribution of the opponent. The blue area represents approximationerror incurred by Q-Mixing.

To account for future uncertainty we must replace the Q-values against individual opponents. Thiscan be done by expanding the Q-value deﬁnition into two parts: expected reward under the currentbelief in the opponent’s policy, and our expected next observation value. The second term, successorobservation value, recursively references a new opponent belief, accounting for changing uncertaintyin the future. The extended formulation, Value Iteration with Q-Mixing (VIQM), is given by: δ = r i ( o ti ,a ti | π − i )+ γ E o t +1 i (cid:2) V ∗ ( o t +1 i | σ − i ) (cid:3) ,Q ∗ i ( o ti ,a ti | σ − i ) = (cid:88) π − i ∈ Π − i ψ i ( π − i | o ti ,σ − i ) · δ. If we assume that we have access to both a dynamics model and the observation distribution dependenton the opponent, then we can directly solve for this quantity through Value Iteration (Algorithm 1).These requirements are quite strong, essentially requiring perfect knowledge of the system with regardsto all opponents. The additional step of Value Iteration also carries computational burden, as it requiresiterating over the full observation and action spaces. Though these costs may render Value Iterationinfeasible in practice, we provide the algorithm as way to ensure correctness in Q-values.

Algorithm 1:

Value Iteration: Q-Mixing

Input: S , A , T , R ,(cid:15),γV ( s | σ − i ) ← (cid:80) π − i σ − i ( π − i ) Q ( s,a | π − i ) do Q t ( s,a | σ − i ) ← (cid:80) π − i ψ ( π − i | s,σ − i ) (cid:80) s (cid:48) , r T ( s (cid:48) ,r | s,a,π − i )[ r + γV t − ( s (cid:48) | σ − i )] V t ( s | σ − i ) ← max a Q t ( s,a | σ − i ) π t ( s | σ − i ) ← argmax a Q t ( s,a | σ − i ) while ∃ s ∈S | V t ( s ) − V t − ( s ) | > (cid:15) Output: V t , Q t , π t We ﬁrst evaluate Q-Mixing on a simple grid-world soccer environment [29, 15]. This environmenthas small state and action spaces, allowing for inexpensive simulation. With this environment wepose the following questions:1. Can VIQM obtain Q-values for mixed-strategy opponents?5. Can Q-Mixing transfer Q-values across all of the opponent’s mixed strategies?The soccer environment is composed of a soccer ﬁeld, two players, one ball, and two goals. Theplayer’s objective is to acquire the ball and score a goal while preventing the opponent from scoringon their own goal. The scorer receives +1 reward, and the opponent receives − reward. The stateconsists of the entire ﬁeld including the locations of the players and ball. The players may move inany of the four cardinal directions or not move. Actions taken by the players are executed in a randomorder, and if the player possessing the ball moves last then the ﬁrst player may take possession of theball by colliding with them. A graphical example of the soccer environment can be seen in Figure 3. BA Figure 3: Grid-world soccer environment. The letters represent the respective players. The ball mayspawn in either middle highlighted tile, and the player’s goal is to score in the opposite net.In our experiments, we learn policies for Player 1 using Double DQN [50]. The policies are deepneural networks with two hidden layers of size 50, and ReLU activation functions. Player 2 playsa strategy over ﬁve policies, using the same shape neural network as Player 1, generated using thedouble oracle algorithm [32, 28]. These policies are frozen for the duration of the experiments. Furtherdetails of the environment and experiments are in Section A.1.

We now turn to our ﬁrst question: whether VIQM can obtain Q-values for mixed-strategy opponents.To answer this, we run the VIQM algorithm against a ﬁxed opponent mixed strategy (Algorithm 1). Weconstruct dynamics models for each opponent by considering the opponent’s policy and the multiagentdynamics model as a single entity. Then we may approximate the relative observation-occupancydistribution by rolling out 30 episodes against each policy and estimating the distribution.In our experiment an optimal policy was reached in fourteen iterations. The resulting policy best-responded to each individual opponent and the mixture. This empirically validates our ﬁrst hypothesis.

Our second question is whether Q-Mixing can produce high-quality responses for any opponentmixture. Our evaluation of this question employs the same ﬁve opponent pure strategies as the previousexperiment. We ﬁrst trained a baseline,

BR(Uniform) , directly against the uniform mixed-strategyopponent. The baseline was trained using 300000 simulation steps. The same hyperparameters wereused to train against each of the opponent’s pure strategies, with the simulation budget split equally. TheQ-values trained respectively are used as the components for Q-Mixing, and an OPC is also trained fromtheir respective replay buffers. This is repeated for ﬁve random seeds. We simulate the performanceof each method against a representative selection of the entire opponent’s mixed-strategy space. Weselect all opponent mixtures truncated to the tenths place, resulting in 860 strategy distributions.Figure 4 shows Q-Mixing’s performance across the opponent mixed-strategy space. Learning inthis domain is fairly easy, so both methods are expected to win against almost every mixed-strategyopponent. Nevertheless, Q-Mixing generalizes across strategies better, albeit with slightly highervariance. While the improvement of Q-Mixing is incremental, we interpret this ﬁrst evaluation asvalidating the promise of Q-Mixing for coverage across mixtures.6igure 4: Q-Mixing’s coverage of the opponent’s strategy space. Each strategy is a mixture overthe 5 opponents. The strategies are sorted per-method by the method’s performance. Shaded regionsrepresent the standard error over 5 seeds. BR(Mixture) is trained against the uniform mixture opponent.Both methods use the same number of experiences.

Not all environments are simple enough for an agent to acquire optimal Q-values, or have statefulenvironment information. In this section, we look at the performance of Q-Mixing under partialobservation. We study the following research questions on a cyber-security game called Attack Graphs:1. Can Q-Mixing work with non-optimal Q-values over observations from a complexenvironment?2. Can using an opponent policy classiﬁer to update the opponent distribution improveperformance?3. Can policy distillation be used to compress a Q-Mixing model, while preserving performance?Attack Graphs model the interaction between a cyber-attacker and a cyber-defender [37]. The nodesof the graph represent security conditions (e.g., vulnerability on a server), and the edges correspondto exploits that activate a security condition, with some probability of success. The attacker’s aimis to reach a set of goal nodes, which provides a large reward for the attacker and a loss for the defender.The defender receives a noisy observation of the graph inﬂuenced by false alarm(s) and false negativesignal(s), while the attacker observes the true state of the graph. The players take actions trying toinﬁltrate or protect the nodes, while each action they take incurs a ﬁxed cost.A variety of cyber-security games on attack graphs have been studied [33, 9, 10, 34, 53]. In ourexperiments, we follow the game formulation of Wright et al. [53], which motivated the developmentof BR oracle based on Deep RL for this domain. The graph used is an Erd˝os-Rényi graph, with 30nodes, and 100 edges. A visualization of the environment and more details are included in Section A.2.The policies in this environment are deep neural networks with two hidden layers of size 256 withtanh activation functions. They are trained using the Double DQN algorithm [50], and frozen for theduration of the experiments. We report results for the defender.

We experimentally evaluate Q-Mixing on the Attack-Graph environment, allowing us to conﬁrm ouralgorithm’s robustness to complex environments. First, a set of 37 opponent policies are generatedthrough the double oracle algorithm [32]. Three pure strategies are randomly chosen from thisset, and a BR is trained against each. Then a mixture of the policies is chosen (arbitrarily), using σ − i = [0 . , . , . . A baseline BR ( σ − i ) is trained directly against the mixture. We evaluate bothQ-Mixing and BR ( σ − i ) against each of the three pure-strategy opponents as well as the mixture.A summary of the performance against each opponent strategy is presented in Table 1. We observe thatQ-Mixing: Prior performs worse than BR ( σ − i ) . Neither Q-Mixing: Prior nor BR ( σ − i ) perform wellagainst the pure-strategy opponents, when compared to their respective BRs. To further understandthe relationship between these two methods, we plot BR ( σ − i ) ’s training curve in Figure 5. From thisgraph we can see that Q-Mixing: Prior achieves a performance equivalent to BR ( σ − i ) , after BR ( σ − i ) π − i π − i π − i σ − i BR( π − i ) − . ± . − . ± . − . ± . − . ± . BR( π − i ) − . ± . − . ± . − . ± . − . ± . BR( π − i ) − . ± . − . ± . − . ± . − . ± . BR( σ − i ) − . ± . − . ± . − . ± . − . ± . Q-Mixing: Prior − . ± . − . ± . − . ± . − . ± . Q-Mixing: Obs. Freq. − . ± . − . ± . − . ± . − . ± . Q-Mixing: Opp. Classiﬁer − . ± . − . ± . − . ± . − . ± . has ﬁnished exploring and has begun to converge. If we are in a training regime where we may alreadyhave access to BRs, then this would enable free policy construction.Figure 5: Learning curve of the BR policy trained against the mixed-strategy opponent. This iscompared to the performance of Q-Mixing: Prior. This brings us to our next research question: how can we improve performance against individual oppo-nent policies? We hypothesize that if a player is able to correctly identify which opponent pure strategyis being played, they can weight the importance of the Q-values learned against that pure strategy higher.To verify this hypothesis, we train an OPC using the replay buffers associated with each BR policy B ( π i ) .These are the same buffers that were used to train the BRs, and cost no additional compute to collect.This data is used to train an OPC that outputs a distribution over opponent pure strategies for eachobservation. We ﬁrst look at observation-frequency as a baseline OPC, and then consider a learned OPC.The observation-frequency OPC uses a relative observation occupancy as a measure of how likely theopponent is playing a particular pure strategy. We weight the importance of each pure strategy by thefrequency the observation occurs in its corresponding replay-buffer, compared to the total frequencyacross all replay buffers. If a observation does not occur in any replay buffers, we equally weight eachQ-value. The observation frequency calculation is computed as: ψ Observation Frequency ( π k − i ,o i ) = count ( o i , B ( π k − i )) (cid:80) π j − i ∈ Π − i count ( o i , B ( π j − i )) , where count ( o i , B ) returns the number of times o i appears in the replay buffer B .The OPC version utilizes a deep neural network to classify each observation to the correspondingopponent pure strategy. The neural network has three hidden layers of sizes 200, 128, and 1288espectively with tanh activations. To train this classiﬁer we take each experience and assign it a classlabel based off the opponent. The classiﬁer is trained using cross-entropy loss.As reported in Table 1, the observation-frequency OPC improves over Q-Mixing: Prior, withinnoise, but dramatically reduces the variance. The opponent-classiﬁer version further improves overobservation-frequency, bringing the performance on par with BR ( σ − i ) with much lower variance.In particular, the OPC performs comparably against the ﬁrst two opponent pure strategies with theirrespective best-responders. The performance against the third pure strategy is poor; however, this is theleast likely opponent, and this may have enabled the stronger performance on the other opponents. Thisindicates that the opponent-classiﬁer is able to correctly identify which pure strategy their opponentis playing, and weight it high enough to not introduce noise from the other component Q-values. Thisevidence supports our hypothesis that using an opponent-classiﬁer to weight the opponent distributioncan provide performance improvements against mixed-strategy opponents. In Q-Mixing we need to compute Q-values for each of the opponent’s pure strategies. This can bea limiting factor in parametric policies, like deep neural networks, where our policy’s complexitygrows linearly in the size of the support of the opponent’s mixture. This can become unwieldy in bothmemory and computation. To remedy these issues, we propose using policy distillation to compressa Q-Mixing policy into a smaller parameters space [19, 40].In the policy distillation framework, a larger neural network referred to as the “teacher” is used asa training target for a smaller neural network called the “student”. In our experiment the Q-Mixingpolicy is the teacher to a student neural network that is the size of a single best-response policy. Thestudent is trained in a supervised learning framework, where the dataset is the concatenated replaybuffers from training pure-strategy best-responses. The is the same dataset that was used in opponentclassifying, that was notably generated without running any additional simulations. A batch of data issampled from the replay-buffer and the student predicts Q S the teacher’s Q T Q-values for each action.The student is then trained to minimize the KL-divergence between the predicted Q-values and theteacher’s true Q-values. There is a small wrinkle, the policies produce Q-values, and KL-divergenceis a metric over probability distributions. To make this loss function compatible, the Q-values aretransformed into a probability distribution by softmax with temperature τ . The temperature parameterallows us to control the softness of the maximum operator. A high temperature produces actions thathave a near-uniform probability, and as temperature is lowered the distribution concentrates weighton the highest Q-Values [46]. The beneﬁt of a higher temperature is that more information can bepassed from the teacher to the student about each state. The full policy distillation loss is: L Distill = | D | (cid:88) i softmax ( Q T τ )ln( softmax ( Q T /τ ) softmax ( Q S /τ ) ) , where D is the dataset of concatenated replay buffers. Additional training details are descibed inSection A.4.The learning curve of the student is reported in Figure 6. We found that the student was able to recoverthe performance of Q-Mixing: OPC, albeit with higher variance, within ﬁve minutes. This study didnot look at tuning the student, and further improvements with the same methodology may be possible.This result conﬁrms our hypothesis that policy distillation is able to effectively compress a policyderived by Q-Mixing. The most relevant line of work in multiagent learning studies the interplay between centralized learningand decentralized execution [47, 26]. In this regime, agents are able to train with additional informationabout the other agents that would not be available during evaluation (e.g., the other agent’s state [38],actions [7], or a coordination signal [15]). The key question then becomes: how to create a policythat can be evaluated without the additional information? A popular approach is to decompose thejoint-action value into independent Q-values for each agent [16, 18, 45, 38, 31]. Another direction9igure 6: Simulation performance of Q-Mixing: OPC (teacher) being distilled into a smaller network(student). Results are averaged over 5 random seeds.looks at learning a centralized critic, which can train independent agent policies [17, 30, 13]. Somework has proposed constructing meta-data about the agent’s current policies as a way to reduce thelearning instability present in environments where other agents’ policies are changing [11, 35].A set of assumptions that can be made is that all players have ﬁxed strategy-sets. Under theseassumptions agents could maintain more sophisticated beliefs about their opponent [55], and extendthis to recursive-reasoning procedures [54]. Extending these works to ﬁt the Q-Mixing problemstatement is tangential to this study; however, may be a promising direction for improving the qualityof the OPC. One more potential extension of the OPC is to consider alternative goals. Instead offocusing exclusively on predicting the opponent, in safety critical situations an agent will want toconsider an objective that accounts for inaccurate prediction of their opponent. The Restricted NashResponse Johanson et al. [22] encapsulates this measure by balancing maximal performance if theprediction is correct with reasonable performance for inaccurate predictions.Instead of building or using complete models of opponents, one may use implicit representation oftheir opponents. By choosing to build an explicit model of their opponent they circumvent needinga large amount of data to reconstruct the opponent policy. An additional beneﬁt is that there are lesslikely to be errors in the model that need to be overcome. He et al. [18] proposes DRON which usesa learned latent action prediction of the opponent as conditioning information to the policy (in a similarnature to the opponent-actions in the joint-action value area). They also show another version DRONwhich uses a Mixture-of-Experts [20] operation to marginalize over the possible opponent behaviors.Bard et al. [3] proposes implicitly modelling opponents through the payoffs received from playingagainst a portfolio of the agent’s policies.Most multiagent learning work focuses on simultaneous learning of many agents, where there is nota distribution of opponents. This difference in methods can have strong inﬂuences in the ﬁnal learnedpolicies. For example, an agent trained concurrently with another agent may be unable to coordinatewith a different agent [4]. Another potential problem is that each agent now faces a dynamic learningproblem, where they must learn a moving target (the other agent’s policy) [12, 49].

Multiagent learning is analogous to multi-task learning. In this reconstruction, each strategy/policyis analogous to solving a different task. And the opponent’s strategy would be the distribution overtasks. Similar analogies to tasks can be made with objectives, goals, contexts, etc. [24, 39].The multi-task community has roughly separated learnable knowledge into two categories [44].

Taskrelevant knowledge pertains to a particular task [23, 51]; meanwhile, domain relevant knowledgeis common across all tasks [6, 14, 25]. Work has been done that bridges the gap between these settings;for example, knowledge about a task could be a curriculum to utilize over tasks [8]. In task relevantlearning, a leading method is to identify state information that is irrelevant to decision making, andabstract it away [23, 51]. Our work falls into the same task relevant category, where we are interestedin learning responses to particular opponent policies. What differentiates our work with the previouswork is that we learn Q-values for each task independently, and do not ignore any information.10rogressively growing neural networks is another similar line of work [41], focused on a stream ofnew tasks. Schwarz et al. [42] also found that network growth could be handled with policy distillation.

Transfer learning is the study of reusing knowledge to learn new tasks/domains/policies. Withintransfer learning, we look at either how knowledge is transferred, or what kind of knowledge istransferred. How to transfer knowledge has previously been studied in main two ﬂavors [36, 27]. The representation transfer direction looks at how to abstract away general characteristics about the taskthat are likely to apply to later problems. Ammar et al. [1] present an algorithm where an agent collectsa shared general set of knowledge that can be used for each particular task. The second directiondirectly transfers parameters across tasks; appropriately called parameter transfer . Taylor et al. [48]show how policies can be reused by creating a projection across different tasks’ state and action spaces.In the literature, transferring knowledge about the opponent’s strategy is considered intra-agent transfer[43]. The focus of this area is on adapting to other agents . One line of work in this area focuses onad hoc teamwork, where an agent must learn to quickly interact with new teammates [5, 4]. The mainapproach relies on already having a set of policies available, and learning to select which policy willwork best with the new team [5]. Another work proposes learning features that are independent ofthe game, which can either be qualities general to all games or strategies [2]. Our study differs fromthese in its focus on the opponent’s policies as the source of information to transfer.

In this study we propose Q-Mixing, an algorithm for transferring knowledge across distributionsof opponents. We show how Q-Mixing relies on the theoretical relationship between an agent’saction-values, and the strategy employed by the other agents. An empirical conﬁrmation of the theoryis ﬁrst made on the grid world soccer environment. In this environment we show how experienceagainst pure strategies can transfer onto mixed-strategy opponents. Moreover, we show that thistransfer is able to cover the space of mixed strategies with no additional computation.Next we tested our algorithm’s robustness on a complicated cyber-security game. In this environmentwe ﬁrst show the beneﬁt of maintaining a belief about the opponent’s policy. This belief is then usedto update weights of their respective BR Q-values. Next we address the concern that a Q-Mixing policymay become too large or computationally expensive to use. To ease this concern we demonstrate thatpolicy distillation can be used to compress a Q-Mixing policy into a much smaller parameter space.

References [1] Haitham Bou Ammar, Eric Eaton, José Marcio Luna, and Paul Ruvolo. Autonomouscross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In , IJCAI, pages 3345–3349, 2015.[2] Bikramjit Banerjee and Peter Stone. General game learning using knowledge transfer. In , IJCAI, pages 672–677, 2007.[3] Nolan Bard, Michael Johanson, Neil Burch, and Michael Bowling. Online implicit agentmodelling. In ,pages 255–262, 2013.[4] Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis Song,Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain Dunning, ShiblMourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling. The Hanabi challenge:A new frontier for AI research.

Artiﬁcial Intelligence , 280, 2020.[5] Samuel Barrett and Peter Stone. Cooperating with unknown teammates in complex domains: Arobot soccer case study of ad hoc teamwork. In ,AAAI, pages 2010–2016, 2015.[6] Rich Caruana. Multitask learning.

Machine Learing , 28(1):41–75, 1997.[7] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multia-gent systems. In , AAAI, pages 746–752, 1998.118] Wojciech Czarnecki, Siddhant Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee WhyeTeh, Nicolas Heess, Simon Osindero, and Razvan Pascanu. Mix & match agent curricula forreinforcement learning. In , ICML, pages1087–1095, 2018.[9] Karel Durkota, Viliam Lisý, Branislav Bošanský, and Christopher Kiekintveld. Optimal networksecurity hardening using attack graph games. In , IJCAI, pages 526–532, 2015.[10] Karel Durkota, Viliam Lisý, Christopher Kiekintveld, Branislav Bošanský, and MichalPechoucek. Case studies of network defense with attack graph games.

IEEE Intelligent Systems ,31:24–30, 2016.[11] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr,Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agentreinforcement learning. In , ICML, pages1146–1155, 2017.[12] Jakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, andIgor Mordatch. Learning with opponent-learning awareness. In , AAMAS, pages 122–130, 2018.[13] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and ShimonWhiteson. Counterfactual multi-agent policy gradients. In , AAAI, pages 2974–2982, 2018.[14] David Foster and Peter Dayan. Structure in the space of value functions.

Machine Learning ,49(2):325–346, 2002.[15] Amy Greenwald and Keith Hall. Correlated-Q learning. In , ICML, pages 242–249, 2003.[16] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored MDPs.In , NeurIPS, pages1523–1530, 2001.[17] Jayesh K. Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent controlusing deep reinforcement learning. In Gita Sukthankar and Juan A. Rodriguez-Aguilar, editors, , AAMAS, pages66–83, 2017.[18] He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deepreinforcement learning. In , ICML, pages1804–1813, 2016.[19] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.In

NeurIPS: Deep Learning and Representation Learning Workshop , 2014.[20] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptivemixtures of local experts.

Neural Computing , 3(1):79–87, March 1991.[21] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio GarciaCastañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, NicolasSonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, KorayKavukcuoglu, and Thore Graepel. Human-level performance in 3D multiplayer games withpopulation-based reinforcement learning.

Science , 364(6443):859–865, 2019.[22] Michael Johanson, Martin Zinkevich, and Michael Bowling. Computing robust counter-strategies.In

Advances in Neural Information Processing Systems 20 , 2007.[23] Nicholas K. Jong and Peter Stone. State abstraction discovery from irrelevant state variables.In , IJCAI, pages 752–757, 2005.[24] Leslie Pack Kaelbling. Learning to achieve goals. In , IJCAI, pages 1094–1099, 1993.[25] George Konidaris and Andrew Barto. Autonomous shaping: Knowledge transfer in reinforcementlearning. In , ICML, pages 489–496, 2006.[26] Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsalfor decentralized planning.

Neurocomputing , 190:82 – 94, 2016.1227] Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamicsand transfer learning in deep linear networks. In , ICLR, 2019.[28] Marc Lanctot, Vinicius Zambaldi, Audr¯unas Gruslys, Angeliki Lazaridou, Karl Tuyls, JulienPérolat, David Silver, and Thore Graepel. A uniﬁed game-theoretic approach to multiagentreinforcement learning. In , NuerIPS, pages 4193–4206, 2017.[29] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In , ICML, pages157–163, 1994.[30] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agentactor-critic for mixed cooperative-competitive environments. In , NeurIPS, pages 6382–6393, 2017.[31] Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agentvariational exploration. In , NeurIPS,2019.[32] H. Brendan McMahan, Geoffrey J. Gordon, and Avrim Blum. Planning in the presence ofcost functions controlled by an adversary. In , ICML, pages 536–543, 2003.[33] Erik Miehling, Mohammad Rasouli, and Demosthenis Teneketzis. Optimal defense policies forpartially observable spreading processes on Bayesian attack graphs. In

Second ACM Workshopon Moving Target Defense , MTD, pages 67–76, 2015.[34] Thanh H. Nguyen, Mason Wright, Michael P. Wellman, and Satinder Baveja. Multi-stage attackgraph security games: Heuristic strategies, with empirical game-theoretic analysis. In

Workshopon Moving Target Defense , MTD ’17, pages 87–97, 2017. ISBN 978-1-4503-5176-8.[35] Shayegan Omidshaﬁei, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian. Deepdecentralized multi-task multi-agent reinforcement learning under partial observability. In , ICML, 2017.[36] S. J. Pan and Q. Yang. A survey on transfer learning.

IEEE Transactions on Knowledge andData Engineering , 22(10):1345–1359, Oct 2010.[37] Cynthia Phillips and Laura Painton Swiler. A graph-based system for network-vulnerabilityanalysis. In

Workshop on New Security Paradigms , NSPW, pages 71–79, 1998.[38] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, JakobFoerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deepmulti-agent reinforcement learning. In ,ICML, pages 4295–4304, 2018.[39] Sebastian Ruder. An overview of multi-task learning in deep neural networks.

CoRR ,abs/1706.05098, 2017.[40] Andrei A. Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, JamesKirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policydistillation. In

International Conference on Learning Representations , ICLR, 2015.[41] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick,Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.

CoRR ,abs/1606.04671, 2016.[42] Jonathan Schwarz, Jelena Luketina, Wojciech M. Czarnecki, Angieszka Grabska-Barwinska,Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable frameworkfor contiual learning. In , ICML, 2018.[43] Felipe Silva and Anna Costa. A survey on transfer learning for multiagent reinforcement learningsystems.

Journal of Artiﬁcial Intelligence Research , 64, 03 2019.[44] Matthijs Snel and Shimon Whiteson. Learning potential functions and their representations formulti-task reinforcement learning. In , AAMAS, pages 637–681, 2014.1345] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi,Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel.Value-decomposition networks for cooperative multi-agent learning. In , AAMAS, 2018.[46] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Introduction . MIT Press,second edition, 2018.[47] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In , ICML, pages 330–337, 1993.[48] Matthew E. Taylor, Peter Stone, and Yaxin Liu. Value functions for RL-based behavior transfer:A comparative study. In , AAAI, pages880–885, 2005.[49] Gerald Tesauro. Extending Q-learning to general adaptive multi-agent systems. In , NeurIPS, page 871–878, 2003.[50] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with doubleQ-learning. In , AAAI, pages 2094–2100, 2016.[51] Thomas J. Walsh, Lihong Li, and Michael L. Littman. Transferring state abstractions betweenMDPs. In

ICML-06 Workshop on Structural Knowledge Transfer for Machine Learning , 2006.[52] Xiaofeng Wang and Tuomas Sandholm. Learning near-Pareto-optimal conventions in polynomialtime. In , NeurIPS,pages 863–870, 2003.[53] Mason Wright, Yongzhao Wang, and Michael P. Wellman. Iterated deep reinforcement learningin games: History-aware training for improved stability. In , EC, pages 617–636, 2019.[54] Tianpei Yang, Jianye Hao, Zhaopeng Meng, Chongjie Zhang, Yan Zheng, and Ze Zheng.Efﬁcient detection and optimal response against sophisticated opponents. In

International JointConference on Artiﬁcial Intelligence , 2019.[55] Yan Zheng, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, and Changjie Fan.A deep bayesian policy reuse approach against non-stationary agents. In , 2018.14

Experimental Details

Double DQN was used for all experiments [50]. Determining the correct hyperparameters to utilize is anon-trivial problem, because the learning dynamics may vary given different opponent policies. To thisend, we sought to ﬁnd a method for selecting hyperparameters that performs well against a diversityin opponents, while also being computationally tractable to run. We choose hyperparameters thatperformed best against a uniform-mixture of a ﬁxed set of opponents (ﬁve for the soccer environment,and three for cyber-security environment). The opponent policies were generated through PSRO,and were sampled from the resulting strategy sets. For both experiments we also included a randomopponent to each strategy set, because we expect this opponent to be one of the more challengingopponents to learn against.However, there’s a chicken-and-egg problem present. In order to utilize PSRO we would alsoneed the aformentioned hyperparameters. As a stop-gap, we choose initial hyperparameters, byevaluating their performance against a random opponent. In summary, for both environments we selecthyperparameters by:1. Sample 200 possible hyperparameter settings, and choose the one that best-performs againsta random opponent.2. Run PSRO until it exceeds a three day walltime.3. Sample a ﬁxed set of policies from the strategy set generated from PSRO.4. Sample 200 hyperparameter settings, and evaluate them against uniform mixed-strategy ofthe four PSRO policies and the random opponent.We chose to evaluate our hyperparameters against the mixed-strategy opponent, because we believedit offered the most beneﬁt to the baseline method. Future work could look at the interplay of thehyperparameter selection method and the respective performance of both Q-Mixing, and learninga BR directly against a mixed-strategy.The non-standard hyperparameters listed throughout the appendix are deﬁned as follows:

Timesteps

Total number of experiences collected during training.

Exploration Fraction

Fraction of the training timesteps used for exploration. The explorationpolicy is (cid:15) -greedy, and starts with (cid:15) = 1 . , and linearly decays to Exploration Final hyperparameter.

Exploration Final (cid:15)

The ﬁnal (cid:15) value.

Training Frequency

Timestep frequency for performing updates.

Training Starts

Number of timesteps experienced before training begins.

Number of Simulations

The number of simulated episodes performed for evaluation.

A.1 SoccerA.1.1 Policy & Environment

In this experiment we select ﬁve opponent policies, and hold them ﬁxed throughout the experiments.We consider the hyperparameters in Table 3 as candidates, and found the hyperparameters listed inTable 2 to perform best.To ensure that each method receives the same simulation budget, we allow each pure-strategy BR60000 timesteps. An interesting future direction is investigating the trade-off of simulation budgetand performance that exists between these methods.The trained DQN was approximated using a 2 hidden layer neural network. The hidden layers eachhad 50 units and were fully connected with a ReLU activation. The possible actions are moving inany of the four cardinal directions, or staying in place. The input is a vector of length 120, to representeach of the 20 positions on the board having one of the following states: • Player 0 is on this square and does not have the ball. • Player 0 is on this square and is holding the ball.15able 2: Soccer environment hyperparameters against a mixed strategy.Hyperparameter ValueOptimizer AdamLearning Rate 0.0003Buffer Size 3000Gamma 0.99Timesteps 300000Batch Size 64Exploration Fraction 0.33Exploration Final (cid:15) • Player 1 is on this square and does not have the ball. • Player 1 is on this square and is holding the ball. • The ball is on the ground on this square. • Unoccupied space.

A.1.2 Opponent Policy Classiﬁer

The hyperparameters selected for training the opponent classiﬁer are listed in Table 4. The replaybuffers gathered from training best-responses against each opponent were merged into one dataset.The classiﬁer was trained to predict the opponent for each observation in the dataset. This resultedin 15000 data points, which were randomly split 90-10 between training and validation.Table 4: Markov-Soccer opponent policy classiﬁer hyperparameters.Hyperparameter ValueOptimizer AdamLearning Rate · − Loss Cross EntropyBatch Size 64The classiﬁer was a neural network with the same architecture as a single policy; however, the lastlayer is modiﬁed to choose opponents rather than actions. We did not perform a hyperparameter searchon this network or learning algorithm.

A.2 Cyber Security Game

In the cyber security environment, two agents played an instance of the attack graph game. We choosethe r graph from Wright et al. [53] that was previously studied using empirical game theoreticanalysis and deep reinforcement learning. This graph is an Erd˝os-Rényi random graph with 30 nodes(including 6 goal nodes) and 100 edges. A visualization of the attack graph is presented in Figure 7.16igure 7: The Attack Graph game enviroment. Goal nodes are represented using red dashed circles.Used with permission from Wright et al. [53].The candidate hyperparameters are listed in Table 6, with the chosen hyperparameters in Table 5].Note, that some hyperparameters differ between the attacker and defender. This is because the policiesexperiences diffference observations, have different action spaces, and experience different rewards(that are non-zero sum).Table 5: AttackGraph environment hyperparameters.Hyperparameter ValueOptimizer AdamLearning Rate · − Gradient Norm Clip 10Buffer Size 30000Gamma 0.99Batch Size 32Exploration Final (cid:15) · − , · − , · − Gradient Norm Clip None, 0.1, 1, 10Buffer Size 10000, 30000, 50000, 70000Batch Size 32, 64Timesteps 25e4, 40e4, 50e4, 70e4, 100e4, 200e4Exploration Fraction 0.2, 0.3, 0.5, 0.7Both the attacker and defender were modelled with a two hidden layer neural network, where eachhidden layer had 256 units and was fully connected with tanh activations. The defender had anobservation size of 240 and action size of 31; similarly, the attacker had an observation size of 241and an action size of 106. The attackers’s state space was composed of the following features:17 s Active

Observation of the true state of the graph, which contains a bit for whether each nodehas been activate. 30 values.

Can Attack

A bit representing whether the preconditions for the and-nodes or or-edges havebeen met. 105 values.

In Attack Set

A bit for each and-node or or-edge representing whether it is currently in theaction set. 105 values.

Time Steps Left

Number of timesteps left for action set building. 1 value.The defender’s state space was composed of the following features:

Had Alert

Observation of the graph that indicates which nodes had previously had an alertraised. This includes a history of the last three observations because the defender only receivesan observation of the graph and not the true graph state, resulting in 90 values.

Was Defended

For each node in had alert whether the node was defended. 90 values.

In Defense Set

During action selection a set of nodes are selected. This feature space is usedto auto-repressively indicate whether each node is currently in the defense set. 30 values.

Time Steps Left

Number of timesteps left for action set building. 30 values (To match theprevious work we continued the convention of copying the node for each value. This wasoriginally done because it allowed convolution layers to be tested.).The action space for this environment is combinatorial, because each agent was able to select anysubset of the nodes/edges in the graph. Following Wright et al. [53], we used greedy action set buildingto choose the player’s actions. We refer the reader to their paper for extensive details.

A.3 Opponent Classiﬁer

The hyperparameters considered for the cyber-security OPC are listed in Table 8, and the chosenhyperparameters are in Table 7. Similar to the soccer environment, the replay buffers from trainingbest-response to each opponent were merged into a single dataset. The classiﬁer was trained to predict,given an observation from the dataset, which opponent was being played when the observation wasencountered. The dataset of 90000 experiences was split 90-10 between training and validation. Thevalidation set was used to select from 20 possible hyperparameter conﬁgurations.Table 7: Cyber-security opponent policy classiﬁer hyperparameters.Hyperparameter ValueOptimizer AdamLearning Rate · − Loss Cross EntropyBatch Size 64Table 8: Cyber-security opponent policy classiﬁer considered hyperparameters.Hyperparameter ValueLearning Rate 0.0001Batch Size 32, 64, 128, 256Hidden Layer Sizes 50x2, 128x2, 200-100, 200x2, 200-128x2The opponent classiﬁer was a 3 hidden layer neural network. The classiﬁer received the sameobservation as the defender, size 240, and was passed through fully-connected layers with sizes 200,128, and 128 respectively with ReLU activations. The network is the same as the policy network, withthe last layer modiﬁed to choose opponents rather than actions. We did not perform a hyperparametersearch on this network or learning algorithm. 18 .4 Policy Distillation