[PDF] Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games

Abstract

Full PDF

PPipeline PSRO: A Scalable Approach for FindingApproximate Nash Equilibria in Large Games

Stephen McAleer ∗ Department of Computer ScienceUniversity of California, IrvineIrvine, CA [email protected]

John Lanier ∗ Department of Computer ScienceUniversity of California, IrvineIrvine, CA [email protected]

Roy Fox

Department of Computer ScienceUniversity of California, IrvineIrvine, CA [email protected]

Pierre Baldi

Department of Computer ScienceUniversity of California, IrvineIrvine, CA [email protected]

Abstract

Finding approximate Nash equilibria in zero-sum imperfect-information games ischallenging when the number of information states is large. Policy Space ResponseOracles (PSRO) is a deep reinforcement learning algorithm grounded in gametheory that is guaranteed to converge to an approximate Nash equilibrium. However,PSRO requires training a reinforcement learning policy at each iteration, makingit too slow for large games. We show through counterexamples and experimentsthat DCH and Rectiﬁed PSRO, two existing approaches to scaling up PSRO, failto converge even in small games. We introduce Pipeline PSRO (P2SRO), theﬁrst scalable general method for ﬁnding approximate Nash equilibria in largezero-sum imperfect-information games. P2SRO is able to parallelize PSRO withconvergence guarantees by maintaining a hierarchical pipeline of reinforcementlearning workers, each training against the policies generated by lower levels inthe hierarchy. We show that unlike existing methods, P2SRO converges to anapproximate Nash equilibrium, and does so faster as the number of parallel workersincreases, across a variety of imperfect information games. We also introducean open-source environment for Barrage Stratego, a variant of Stratego with anapproximate game tree complexity of . P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots. A long-standing goal in artiﬁcial intelligence and algorithmic game theory has been to developa general algorithm which is capable of ﬁnding approximate Nash equilibria in large imperfect-information two-player zero-sum games. AlphaStar [Vinyals et al., 2019] and OpenAI Five [Berneret al., 2019] were able to demonstrate that variants of self-play reinforcement learning are capableof achieving expert-level performance in large imperfect-information video games. However, thesemethods are not principled from a game-theoretic point of view and are not guaranteed to convergeto an approximate Nash equilibrium. Policy Space Response Oracles (PSRO) [Lanctot et al., 2017]is a game-theoretic reinforcement learning algorithm based on the Double Oracle algorithm and isguaranteed to converge to an approximate Nash equilibrium. ∗ Authors contributed equallyPreprint. Under review. a r X i v : . [ c s . G T ] J un SRO is a general, principled method for ﬁnding approximate Nash equilibria, but it may notscale to large games because it is a sequential algorithm that uses reinforcement learning to traina full best response at every iteration. Two existing approaches parallelize PSRO: Deep CognitiveHierarchies (DCH) [Lanctot et al., 2017] and Rectiﬁed PSRO [Balduzzi et al., 2019], but both havecounterexamples on which they fail to converge to an approximate Nash equilibrium, and as we showin our experiments, neither reliably converge in random normal form games.Although DCH approximates PSRO, it has two main limitations. First, DCH needs the same numberof parallel workers as the number of best response iterations that PSRO takes. For large games, thisrequires a very large number of parallel reinforcement learning workers. This also requires guessinghow many iterations the algorithm will need before training starts. Second, DCH keeps trainingpolicies even after they have plateaued. This introduces variance by allowing the best responses ofearly levels to change each iteration, causing a ripple effect of instability. We ﬁnd that, in randomnormal form games, DCH rarely converges to an approximate Nash equilibrium even with a largenumber of parallel workers, unless their learning rate is carefully annealed.Rectiﬁed PSRO is a variant of PSRO in which each learner only plays against other learners that italready beats. We prove by counterexample that Rectiﬁed PSRO is not guaranteed to converge toa Nash equilibrium. We also show that Rectiﬁed PSRO rarely converges in random normal formgames.In this paper we introduce Pipeline PSRO (P2SRO), the ﬁrst scalable general method for ﬁndingapproximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able toscale up PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcementlearning workers, each training against the policies generated by lower levels in the hierarchy. P2SROhas two classes of policies: ﬁxed and active. Active policies are trained in parallel while ﬁxed policiesare not trained anymore. Each parallel reinforcement learning worker trains an active policy in ahierarchical pipeline, training against the meta Nash equilibrium of both the ﬁxed policies and theactive policies on lower levels in the pipeline. Once the performance increase of the lowest-levelactive worker in the pipeline does not improve past a given threshold in a given amount of time,the policy becomes ﬁxed, and a new active policy is added to the pipeline. P2SRO is guaranteed toconverge to an approximate Nash equilibrium. Unlike Rectiﬁed PSRO and DCH, P2SRO convergesto an approximate Nash equilibrium across a variety of imperfect information games such as Leducpoker and random normal form games.We also introduce an open-source environment for Barrage Stratego, a variant of Stratego. BarrageStratego is a large two-player zero sum imperfect information board game with an approximate gametree complexity of . We demonstrate that P2SRO is able to achieve state-of-the-art performanceon Barrage Stratego, beating all existing bots.To summarize, in this paper we provide the following contributions: • We develop a method for parallelizing PSRO which is guaranteed to converge to an approxi-mate Nash equilibrium, and show that this method outperforms existing methods on randomnormal form games and Leduc poker. • We present theory analyzing the performance of PSRO as well as a counterexample whereRectiﬁed PSRO does not converge to an approximate Nash equilibrium. • We introduce an open-source environment for Stratego and Barrage Stratego, and demon-strate state-of-the-art performance of P2SRO on Barrage Stratego.

A two-player normal-form game is a tuple (Π , U ) , where Π = (Π , Π ) is the set of policies(or strategies), one for each player, and U : Π → R is a payoff table of utilities for each jointpolicy played by all players. For the game to be zero-sum, for any pair of policies π ∈ Π , thepayoff u i ( π ) to player i must be the negative of the payoff u − i ( π ) to the other player, denoted − i .Players try to maximize their own expected utility by sampling from a distribution over the policies σ i ∈ Σ i = ∆(Π i ) . The set of best responses to a mixed policy σ i is deﬁned as the set of policiesthat maximally exploit the mixed policy: BR ( σ i ) = arg min σ (cid:48)− i ∈ Σ − i u i ( σ (cid:48)− i , σ i ) , where u i ( σ ) =E π ∼ σ [ u i ( π )] . The exploitability of a pair of mixed policies σ is deﬁned as: E XPLOITABILITY ( σ ) = ( u ( σ , BR ( σ )) + u ( BR ( σ ) , σ )) ≥ . A pair of mixed policies σ = ( σ , σ ) is a Nashequilibrium if E XPLOITABILITY ( σ ) = 0 . An approximate Nash equilibrium at a given level ofprecision (cid:15) is a pair of mixed policies σ such that E XPLOITABILITY ( σ ) ≤ (cid:15) .In small normal-form games, Nash equilibria can be found via linear programming [Nisan et al.,2007]. However, this quickly becomes infeasible when the size of the game increases. In largenormal-form games, no-regret algorithms such as ﬁctitious play, replicator dynamics, and regretmatching can asymptotically ﬁnd approximate Nash equilibria [Fudenberg et al., 1998, Taylor andJonker, 1978, Zinkevich et al., 2008]. Extensive form games extend normal-form games and allowfor sequences of actions. Examples of perfect-information extensive form games include chess andGo, and examples of imperfect-information extensive form games include poker and Stratego.In perfect information extensive-form games, algorithms based on minimax tree search have hadsuccess on games such as checkers, chess and Go [Silver et al., 2017]. Extensive-form ﬁctitious play(XFP) [Heinrich et al., 2015] and counterfactual regret minimization (CFR) [Zinkevich et al., 2008]extend ﬁctitious play and regret matching, respectively, to extensive form games. In large imperfectinformation games such as heads up no-limit Texas Hold ’em, counterfactual regret minimization hasbeen used on an abstracted version of the game to beat top humans [Brown and Sandholm, 2018].However, this is not a general method because ﬁnding abstractions requires expert domain knowledgeand cannot be easily done for different games. For very large imperfect information games such asBarrage Stratego, it is not clear how to use abstractions and CFR. Deep CFR [Brown et al., 2019] isa general method that trains a neural network on a buffer of counterfactual values. However, DeepCFR uses external sampling, which may be impractical for games with a large branching factor suchas Stratego and Barrage Stratego. Current Barrage Stratego bots are based on imperfect informationtree search and are unable to beat even intermediate-level human players [Schadd and Winands, 2009,Jug and Schadd, 2009].Recently, deep reinforcement learning has proven effective on high-dimensional sequential decisionmaking problems such as Atari games and robotics [Li, 2017]. AlphaStar [Vinyals et al., 2019] beattop humans at Starcraft using self-play and population-based reinforcement learning. Similaraly,OpenAI Five [Berner et al., 2019] beat top humans at Dota using self play reinforcement learning.Similar population-based methods have achieved human-level performance on Capture the Flag[Jaderberg et al., 2019]. However, these algorithms are not guaranteed to converge to an approximateNash equilibrium. Neural Fictitious Self Play (NFSP) [Heinrich and Silver, 2016] approximatesextensive-form ﬁctitious play by progressively training a best response against an average of all pastpolicies using reinforcement learning. The average policy is represented by a neural network and istrained via supervised learning using a replay buffer of past best response actions. This replay buffermay become prohibitively large in complex games. The Double Oracle algorithm [McMahan et al., 2003] is an algorithm for ﬁnding a Nash equilibriumin normal form games. The algorithm works by keeping a population of policies Π t ⊂ Π at time t .Each iteration a Nash equilibrium σ ∗ ,t is computed for the game restricted to policies in Π t . Then,a best response to this Nash equilibrium for each player BR ( σ ∗ ,t − i ) is computed and added to thepopulation Π t +1 i = Π ti ∪ { BR ( σ ∗ ,t − i ) } for i ∈ { , } .Policy Space Response Oracles (PSRO) approximates the Double Oracle algorithm. The meta Nashequilibrium is computed on the empirical game matrix U Π , given by having each policy in thepopulation Π play each other policy and tracking average utility in a payoff matrix. In each iteration,an approximate best response to the current meta Nash equilibrium over the policies is computed viaany reinforcement learning algorithm. In this work we use a discrete-action version of Soft ActorCritic (SAC), described in Section 3.1.DCH [Lanctot et al., 2017] parallelizes PSRO by training multiple reinforcement learning agents,each against the meta Nash equilibrium of agents below it in the hierarchy. One problem with DCHis that one needs to set the number of workers equal to the number of policies in the ﬁnal populationbeforehand. For large games such as Barrage Stratego, this might require hundreds of parallel workers.Also, in practice, DCH fails to converge in small random normal form games even with an exactbest-response oracle and a learning rate of , because early levels may change their best responseoccasionally due to randomness in estimation of the meta Nash equilibrium. In our experiments and3igure 1: Pipeline PSRO. The lowest-level active policy π j (blue) plays against the meta Nashequilibrium σ ∗ ,j of the lower-level ﬁxed policies in Π f (gray). Each additional active policy (green)plays against the meta Nash equilibrium of the ﬁxed and training policies in levels below it. Oncethe lowest active policy plateaus, it becomes ﬁxed, a new active policy is added, and the next activepolicy becomes the lowest active policy. In the ﬁrst iteration, the ﬁxed population consists of a singlerandom policy.in the DCH experiments in Lanctot et al. [2017], DCH is unable to achieve low exploitability onLeduc poker.Another existing parallel PSRO algorithm is Rectiﬁed PSRO [Balduzzi et al., 2019]. Rectiﬁed PSROassigns each learner to play against the policies that it currently beats. However, we prove thatRectiﬁed PSRO does not converge to a Nash equilibrium in all symmetric zero-sum games. In ourexperiments, Rectiﬁed PSRO rarely converges to an approximate Nash equilibrium in random normalform games. Pipeline PSRO (P2SRO; Algorithm 1) is able to scale up PSRO with convergence guarantees bymaintaining a hierarchical pipeline of reinforcement learning policies, each training against thepolicies in the lower levels of the hierarchy (Figure 1). P2SRO has two classes of policies: ﬁxedand active. The set of ﬁxed policies are denoted by Π f and do not train anymore, but remain in theﬁxed population. The parallel reinforcement learning workers train the active policies, denoted Π a in a hierarchical pipeline, training against the meta Nash equilibrium distribution of both the ﬁxedpolicies and the active policies in levels below it in the pipeline. The entire population Π consists ofthe union of Π f and Π a . For each policy π ji in the active policies Π ai , to compute the distribution ofpolicies to train against, a meta Nash equilibrium σ ∗ ,j − i is periodically computed on policies lowerthan π ji : Π f − i ∪ { π k − i ∈ Π a − i | k < j } and π ji trains against this distribution.The performance of a policy π j is given by the average performance during training E π ∼ σ ∗ ,j [ u ( π , π j )] + E π ∼ σ ∗ ,j [ u ( π j , π )] against the meta Nash equilibrium distribution σ ∗ ,j .Once the performance of the lowest-level active policy π j in the pipeline does not improve past agiven threshold in a given amount of time, we say that the policy’s performance plateaus, and π j becomes ﬁxed and is added to the ﬁxed population Π f . Once π j is added to the ﬁxed population Π f , then π j +1 becomes the new lowest active policy. A new policy is initialized and added as thehighest-level policy in the active policies Π a . Because the lowest-level policy only trains against theprevious ﬁxed policies Π f , P2SRO maintains the same convergence guarantees as PSRO. UnlikePSRO, however, each policy in the pipeline above the lowest-level policy is able to get a head start by4 lgorithm 1 Pipeline Policy-Space Response Oracles

Input:

Initial policy sets for all players Π f Compute expected utilities for empirical payoff matrix U Π for each joint π ∈ Π Compute meta-Nash equilibrium σ ∗ ,j over ﬁxed policies (Π f ) for many episodes dofor all π j ∈ Π a in parallel dofor player i ∈ { , } do Sample π − i ∼ σ ∗ ,j − i Train π ji against π − i end forif π j plateaus and π j is the lowest active policy then Π f = Π f ∪ { π j } Initialize new active policy at a higher level than all existing active policiesCompute missing entries in U Π from Π Compute meta Nash equilibrium for each active policy end if

Periodically compute meta Nash equilibrium for each active policy end forend for

Output current meta Nash equilibrium on whole population σ ∗ pre-training against the moving target of the meta Nash equilibrium of the policies below it. UnlikeRectiﬁed PSRO and DCH, P2SRO converges to an approximate Nash equilibrium across a variety ofimperfect information games such as Leduc Poker and random normal form games.In our experiments we model the non-symmetric games of Leduc poker and Barrage Stratego assymmetric games by training one policy that can observe which player it is at the start of the game andplay as either the ﬁrst or the second player. We ﬁnd that in practice it is more efﬁcient to only trainone population than to train two different populations, especially in larger games, such as BarrageStratego. For the meta Nash equilibrium solver we use ﬁctitious play [Fudenberg et al., 1998]. Fictitious play isa simple method for ﬁnding an approximate Nash equilibrium in normal form games. Every iteration,a best response to the average strategy of the population is added to the population. The averagestrategy converges to an approximate Nash equilibrium. For the approximate best response oracle,we use a discrete version of Soft Actor Critic (SAC) [Haarnoja et al., 2018, Christodoulou, 2019].We modify the version used in RLlib [Liang et al., 2018, Moritz et al., 2018] to account for discreteactions. We will release our code in an open-source repository.

PSRO is guaranteed to converge to an approximate Nash equilibrium and doesn’t need a large replaybuffer, unlike NFSP and Deep CFR. In the worst case, all policies in the original game must be addedbefore PSRO reaches an approximate Nash equilibrium. Empirically, on random normal form games,PSRO performs better than selecting pure strategies at random without replacement. This implies thatin each iteration, PSRO is more likely than random to add a pure strategy that is part of the support ofthe Nash equilibrium of the full game, suggesting the conjecture that PSRO has faster convergencerate than random strategy selection. The following theorem indirectly supports this conjecture.

Theorem 3.1.

Let σ be a Nash equilibrium of a symmetric normal form game (Π , U ) and let Π e bethe set of pure strategies in its support. Let Π (cid:48) ⊂ Π be a population that does not cover Π e (cid:54)⊆ Π (cid:48) ,and let σ (cid:48) be the meta Nash equilibrium of the original game restricted to strategies in Π (cid:48) . Then thereexists a pure strategy π ∈ Π e \ Π (cid:48) such that π does not lose to σ (cid:48) .Proof. Contained in supplementary material. 5deally, PSRO would be able to add a member of Π e \ Π (cid:48) to the current population Π (cid:48) at each iteration.However, the best response to the current meta Nash equilibrium σ (cid:48) is generally not a member of Π e .Theorem 3.1 shows that for an approximate best response algorithm with a weaker guarantee of notlosing to σ (cid:48) , it is possible that a member of Π e \ Π (cid:48) is added at each iteration.Even assuming that a policy in the Nash equilibrium support is added at each iteration, the convergenceof PSRO to an approximate Nash equilibrium can be slow because each policy is trained sequentiallyby a reinforcement learning algorithm. DCH, Rectiﬁed PSRO, and P2SRO are methods of speedingup PSRO through parallelization. In large games, many of the basic skills (such as extracting featuresfrom the board) may need to be relearned when starting each iteration from scratch. DCH andP2SRO are able to speed up PSRO by pre-training each level on the moving target of the meta Nashequilibrium of lower-level policies before those policies converge. This speedup would be linear withthe number of parallel workers if each policy could train on the ﬁxed ﬁnal meta Nash equilibrium ofthe policies below it. Since it trains instead on a moving target, we expect the speedup to be sub-linearin the number of workers.DCH is an approximation of PSRO that is not guaranteed to converge to an approximate Nashequilibrium if the number of levels is not equal to the number of pure strategies in the game, and is infact guaranteed not to converge to an approximate Nash equilibrium if the number of levels cannotsupport it.Another parallel PSRO algorithm, Rectiﬁed PSRO, is not guaranteed to converge to an approximateNash equilibrium. Proposition 3.1.

Rectiﬁed PSRO with an oracle best response does not converge to a Nash equilib-rium in all symmetric two-player, zero-sum normal form games.Proof.

Consider the following symmetric two-player zero-sum normal form game:  − − − − − −  This game is based on Rock–Paper–Scissors, with an extra strategy added that beats all other strategiesand is the pure Nash equilibrium of the game. Suppose the population of Rectiﬁed PSRO starts as thepure Rock strategy. • Iteration 1: Rock ties with itself, so a best response to Rock (Paper) is added to thepopulation. • Iteration 2: The meta Nash equilibrium over Rock and Paper has all mass on Paper. Thenew strategy that gets added is the best response to Paper (Scissors). • Iteration 3: The meta Nash equilibrium over Rock, Paper, and Scissors equally weights eachof them. Now, for each of the three strategies, Rectiﬁed PSRO adds a best response to themeta-Nash-weighted combination of strategies that it beats or ties. Since Rock beats or tiesRock and Scissors, a best response to a − combination of Rock and Scissors is Rock,with an expected utility of . Similarly, for Paper, since Paper beats or ties Paper and Rock,a best response to a − combination of Paper and Rock is Paper. For Scissors, the bestresponse for an equal mix of Scissors and Paper is Scissors. So in this iteration no strategyis added to the population and the algorithm terminates.We see that the algorithm terminates without expanding the fourth strategy. The meta Nash equilib-rium of the ﬁrst three strategies that Rectiﬁed PSRO ﬁnds are not a Nash equilibrium of the full game,and are exploited by the fourth strategy, which is guaranteed to get a utility of against any mixtureof them.The pattern of the counterexample presented here is possible to occur in large games, which suggeststhat Rectiﬁed PSRO may not be an effective algorithm for ﬁnding an approximate Nash equilibrium6 a) Leduc poker (b) Random Symmetric Normal Form Games Figure 2: Exploitability of Algorithms on Leduc poker and Random Symmetric Normal Form Gamesin large games. Prior work has found that Rectiﬁed PSRO does not converge to an approximate Nashequilibrium in Kuhn Poker [Muller et al., 2020].

Proposition 3.2.

P2SRO with an oracle best response converges to a Nash equilibrium in all two-player, zero-sum normal form games.Proof.

Since only the lowest active policy can be submitted to the ﬁxed policies, this policy is anoracle best response to the meta Nash distribution of the ﬁxed policies, making P2SRO with an oraclebest response equivalent to the Double Oracle algorithm.Unlike DCH which becomes unstable when early levels change, P2SRO is able to avoid this problembecause early levels become ﬁxed once they plateau. While DCH only approximates PSRO, P2SROhas equivalent guarantees to PSRO because the lowest active policy always trains against a ﬁxed metaNash equilibrium before plateauing and becoming ﬁxed itself. This ﬁxed meta Nash distributionthat it trains against is in principle the same as the one that PSRO would train against. The onlydifference between P2SRO and PSRO is that the extra workers in P2SRO are able to get a head-startby pre-training on lower level policies while those are still training. Therefore, P2SRO inherits theconvergence guarantees from PSRO while scaling up when multiple processors are available.

We compare P2SRO with DCH, Rectiﬁed PSRO, and a naive way of parallelizing PSRO that we termNaive PSRO. Naive PSRO is a way of parallelizing PSRO where each additional worker trains againstthe same meta Nash equilibrium of the ﬁxed policies. Naive PSRO is beneﬁcial when randomness inthe reinforcement learning algorithm leads to a diversity of trained policies, and in our experiments itperforms only slightly better than PSRO. Additionally, in random normal form game experiments,we include the original, non-parallel PSRO algorithm, termed sequential PSRO, and non-parallelizedself-play, where a single policy trains against the latest policy in the population.We ﬁnd that DCH fails to reliably converge to an approximate Nash equilibrium across randomsymmetric normal form games and small poker games. We believe this is because early levelscan randomly change even after they have plateaued, causing instability in higher levels. In ourexperiments, we analyze the behavior of DCH with a learning rate of 1 in random normal formgames. We hypothesized that DCH with a learning rate of 1 would be equivalent to the double oraclealgorithm and converge to an approximate Nash. However, we found that the best response to a ﬁxedset of lower levels can be different in each iteration due to randomness in calculating a meta Nashequilibrium. This causes a ripple effect of instability through the higher levels. We ﬁnd that DCHalmost never converges to an approximate Nash equilibrium in random normal form games.Although not introduced in the original paper, we ﬁnd that DCH converges to an approximate Nashequilibrium with an annealed learning rate. An annealed learning rate allows early levels to notcontinually change, so the variance of all of the levels can tend to zero. Reinforcement learningalgorithms have been found to empirically converge to approximate Nash equilibria with annealed7able 1: P2SRO Results vs. Existing BotsName P2SRO Win Rate vs. BotAsmodeus 81%Celsius 70%Vixen 69%Celsius1.1 65%

All Bots Average 71% learning rates [Srinivasan et al., 2018, Bowling and Veloso, 2002]. We ﬁnd that DCH with anannealed learning rate does converge to an approximate Nash equilibrium, but it can converge slowlydepending on the rate of annealing. Furthermore, annealing the learning rate can be difﬁcult to tunewith deep reinforcement learning, and can slow down training considerably.

For each experiment, we generate a random symmetric zero-sum normal form game of dimension n by generating a random antisymmetric matrix P . Each element in the upper triangle is distributeduniformly: ∀ i < j ≤ n , a i,j ∼ U NIFORM ( − , . Every element in the lower triangle is set to bethe negative of its diagonal counterpart: ∀ j < i ≤ n , a i,j = − a j,i . The diagonal elements are equalto zero: a i,i = 0 . The matrix deﬁnes the utility of two pure strategies to the row player. A strategy π ∈ ∆ n is a distribution over the n pure strategies of the game given by the rows (or equivalently,columns) of the matrix. In these experiments we can easily compute an exact best response to astrategy and do not use reinforcement learning to update each strategy. Instead, as a strategy π "trains" against another strategy ˆ π , it is updated by a learning rate r multiplied by the best responseto that strategy: π (cid:48) = r BR (ˆ π ) + (1 − r ) π .Figure 2 show results for each algorithm on random symmetric normal form games of dimension60, about the same dimension of the normal form of Kuhn poker. We run each algorithm on ﬁvedifferent random symmetric normal form games. We report the mean exploitability over time of thesealgorithms and add error bars corresponding to the standard error of the mean. P2SRO reaches anapproximate Nash equilibrium much faster than the other algorithms. Additional experiments ondifferent dimension games and different learning rates are included in the supplementary material. Ineach experiment, P2SRO converges to an approximate Nash equilibrium much faster than the otheralgorithms. Leduc poker is played with a deck of six cards of two suits with three cards each. Each player betsone chip as an ante, then each player is dealt one card. After, there is a a betting round and thenanother card is dealt face up, followed by a second betting round. If a player’s card is the same rankas the public card, they win. Otherwise, the player whose card has the higher rank wins. We run thefollowing parallel PSRO algorithms on Leduc: P2SRO, DCH, Rectiﬁed PSRO, and Naive PSRO. Werun each algorithm for three random seeds with three workers each. Results are shown in Figure 2.We ﬁnd that P2SRO is much faster than the other algorithms, reaching 0.4 exploitability almost twiceas soon as Naive PSRO. DCH and Rectiﬁed PSRO never reach a low exploitability.

Barrage Stratego is a smaller variant of the board game Stratego that is played competitively byhumans. The board consists of a ten-by-ten grid with two two-by-two barriers in the middle. Initially,each player only knows the identity of their own eight pieces. At the beginning of the game, eachplayer is allowed to place these pieces anywhere on the ﬁrst four rows closest to them. More detailsabout the game are included in the supplementary material.We compare to all existing bots that are able to play Barrage Stratego [Moore, 2014]. Thesebots include: Vixen, Asmodeus, and Celsius. Other bots such as Probe and Master of the Flagexist, but can only play Stratego and not Barrage Stratego. We show results of P2SRO against8he bots in Table 1. We ﬁnd that P2SRO is able to beat these existing bots by on averageafter , episodes, and has a win rate of over against each bot. We introduce an open-source gym environment for Stratego, Barrage Stratego, and smaller Stratego games at https://github.com/JBLanier/stratego_gym . Broader Impact

Stratego and Barrage Stratego are very large imperfect information board games played by manyaround the world. Although variants of self-play reinforcement learning have achieved grandmasterlevel performance on video games, it is unclear if these algorithms could work on Barrage Strategoor Stratego because they are not principled and fail on smaller games. We believe that P2SRO will beable to achieve increasingly good performance on Barrage Stratego and Stratego as more time andcompute are added to the algorithm. We are currently training P2SRO on Barrage Stratego and wehope that the research community will also take interest in beating top humans at these games as achallenge and inspiration for artiﬁcial intelligence research.This research focuses on how to scale up algorithms for computing approximate Nash equilibria inlarge games. These methods are very compute-intensive when applied to large games. Naturally,this favors large tech companies or governments with enough resources to apply this method forlarge, complex domains, including in real-life scenarios such as stock trading and e-commerce. Itis hard to predict who might be put at an advantage or disadvantage as a result of this research,and it could be argued that powerful entities would gain by reducing their exploitability. However,the same players already do and will continue to beneﬁt from information and computation gapsby exploiting suboptimal behavior of disadvantaged parties. It is our belief that, in the long run,preventing exploitability and striving as much as practical towards a provably efﬁcient equilibriumcan serve to level the ﬁeld, protect the disadvantaged, and promote equity and fairness.

Acknowledgments and Disclosure of Funding

SM and PB in part supported by grant NSF 1839429 to PB.

References

D. Balduzzi, M. Garnelo, Y. Bachrach, W. Czarnecki, J. Perolat, M. Jaderberg, and T. Graepel.Open-ended learning in symmetric zero-sum games. In

International Conference on MachineLearning , pages 434–443, 2019.C. Berner, G. Brockman, B. Chan, V. Cheung, P. D˛ebiak, C. Dennison, D. Farhi, Q. Fischer,S. Hashme, C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprintarXiv:1912.06680 , 2019.M. Bowling and M. Veloso. Multiagent learning using a variable learning rate.

Artiﬁcial Intelligence ,136(2):215–250, 2002.N. Brown and T. Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top profes-sionals.

Science , 359(6374):418–424, 2018.N. Brown, A. Lerer, S. Gross, and T. Sandholm. Deep counterfactual regret minimization. In

International Conference on Machine Learning , pages 793–802, 2019.P. Christodoulou. Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207 ,2019.D. Fudenberg, F. Drew, D. K. Levine, and D. K. Levine.

The theory of learning in games . The MITPress, 1998.T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deepreinforcement learning with a stochastic actor. In

International Conference on Machine Learning ,pages 1861–1870, 2018. 9. Heinrich and D. Silver. Deep reinforcement learning from self-play in imperfect-informationgames. arXiv preprint arXiv:1603.01121 , 2016.J. Heinrich, M. Lanctot, and D. Silver. Fictitious self-play in extensive-form games. In

InternationalConference on Machine Learning , pages 805–813, 2015.M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C.Rabinowitz, A. S. Morcos, A. Ruderman, et al. Human-level performance in 3d multiplayer gameswith population-based reinforcement learning.

Science , 364(6443):859–865, 2019.S. Jug and M. Schadd. The 3rd stratego computer world championship.

Icga Journal , 32(4):233,2009.M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel.A uniﬁed game-theoretic approach to multiagent reinforcement learning. In

Advances in NeuralInformation Processing Systems , pages 4190–4203, 2017.Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 , 2017.E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. Jordan, and I. Stoica.Rllib: Abstractions for distributed reinforcement learning. In

International Conference on MachineLearning , pages 3053–3062, 2018.H. B. McMahan, G. J. Gordon, and A. Blum. Planning in the presence of cost functions controlledby an adversary. In

Proceedings of the 20th International Conference on Machine Learning(ICML-03) , pages 536–543, 2003.S. Moore. Stratego ai evaluator. https://github.com/braathwaate/strategoevaluator ,2014.P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I.Jordan, et al. Ray: A distributed framework for emerging { AI } applications. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI } , pages 561–577, 2018.P. Muller, S. Omidshaﬁei, M. Rowland, K. Tuyls, J. Perolat, S. Liu, D. Hennes, L. Marris, M. Lanctot,E. Hughes, et al. A generalized training approach for multiagent learning. International Conferenceon Learning Representations (ICLR) , 2020.N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani.

Algorithmic Game Theory . CambridgeUniversity Press, 2007.M. Schadd and M. Winands. Quiescence search for stratego. In

Proceedings of the 21st BeneluxConference on Artiﬁcial Intelligence. Eindhoven, the Netherlands , 2009.D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. v. d. Driessche, T. Graepel, andD. Hassabis. Mastering the game of go without human knowledge.

Nature , 550(7676):354–359, 102017. ISSN 0028-0836. doi: 10.1038/nature24270. URL http:https://doi.org/10.1038/nature24270 .S. Srinivasan, M. Lanctot, V. Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling. Actor-critic policy optimization in partially observable multiagent environments. In

Advances in neuralinformation processing systems , pages 3422–3435, 2018.P. D. Taylor and L. B. Jonker. Evolutionary stable strategies and game dynamics.

Mathematicalbiosciences , 40(1-2):145–156, 1978.O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell,T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcementlearning.

Nature , 575(7782):350–354, 2019.M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games withincomplete information. In

Advances in neural information processing systems , pages 1729–1736,2008. 10igure 3: Valid Barrage Stratego Setup (note that the piece values are not visible to the other player)

A Proofs of Theorems

Theorem A.1.

Barrage is a smaller variant of the board game Stratego that is played competitively by humans. Theboard consists of a ten-by-ten grid with two two-by-two barriers in the middle (see image for details).Each player has eight pieces, consisting of one Marshal, one General, one Miner, two Scouts, oneSpy, one Bomb, and one Flag. Crucially, each player only knows the identity of their own pieces. Atthe beginning of the game, each player is allowed to place these pieces anywhere on the ﬁrst fourrows closest to them.The Marshal, General, Spy, and Miner may move only one step to any adjacent space but notdiagonally. Bomb and Flag pieces cannot be moved. The Scout may move in a straight line like arook in chess. A player can attack by moving a piece onto a square occupied by an opposing piece.Both players then reveal their piece’s rank and the weaker piece gets removed. If the pieces are ofequal rank then both get removed. The Marshal has higher rank than all other pieces, the General hashigher rank than all other beside the Marshal, the Miner has higher rank than the Scout, Spy, Flag,and Bomb, the Scout has higher rank than the Spy and Flag, and the Spy has higher rank than the Flag11nd the Marshal when it attacks the Marshal. Bombs cannot attack but when another piece besidesthe Miner attacks a Bomb, the Bomb has higher rank. The player who captures his/her opponent’sFlag or prevents the other player from moving any piece wins.