Finding and Certifying (Near-)Optimal Strategies in Black-Box Extensive-Form Games
FFinding and Certifying (Near-)Optimal Strategiesin Black-Box Extensive-Form Games
Brian Hu Zhang, Tuomas Sandholm Computer Science Department, Carnegie Mellon University Strategic Machine, Inc. Strategy Robot, Inc. Optimized Markets, Inc. { bhzhang, sandholm } @cs.cmu.edu Abstract
Often—for example in war games, strategy video games,and financial simulations—the game is given to us only asa black-box simulator in which we can play it. In these set-tings, since the game may have unknown nature action distri-butions (from which we can only obtain samples) and/or betoo large to expand fully, it can be difficult to compute strate-gies with guarantees on exploitability. Recent work (Zhangand Sandholm 2020) resulted in a notion of certificate forextensive-form games that allows exploitability guaranteeswhile not expanding the full game tree. However, that workassumed that the black box could sample or expand arbitrarynodes of the game tree at any time, and that a series of exactgame solves (via, for example, linear programming) can beconducted to compute the certificate. Each of those two as-sumptions severely restricts the practical applicability of thatmethod. In this work, we relax both of the assumptions. Weshow that high-probability certificates can be obtained with ablack box that can do nothing more than play through games,using only a regret minimizer as a subroutine. As a bonus, weobtain an equilibrium-finding algorithm with ˜ O ( √ T ) regretbound in the extensive-form game setting that does not relyon a sampling strategy with lower-bounded reach probabili-ties (which MCCFR assumes). We demonstrate experimen-tally that, in the black-box setting, our methods are able toprovide nontrivial exploitability guarantees while expandingonly a small fraction of the game tree. Computational equilibrium finding has led to many recentbreakthroughs in AI in games such as poker (Bowling et al.2015; Brown and Sandholm 2017, 2019b) where the gameis fully known. However, in many applications, the game isnot fully known; instead, it is given only via a simulator thatpermits an algorithm to play through the game repeatedly(e.g., Wellman 2006; Lanctot et al. 2017). The algorithmmay never know the game exactly. While deep reinforce-ment learning has yielded strong practical results in this set-ting (Vinyals et al. 2019; Berner et al. 2019), those meth-ods lack the low-exploitability guarantees of game-theoretictechniques, even with infinite samples and computation. Fur-thermore, the standard method of evaluating exploitability of
Copyright c (cid:13) a strategy—computing the equilibrium gap of the strategy—is to compute a best response for each player. This, however,assumes the whole game to be known exactly.Recently, Zhang and Sandholm (2020) defined a notion of certificate for imperfect-information extensive-form gamesthat can address these problems. A certificate enables veri-fication of the exploitability of a given strategy without ex-ploring the whole game tree. However, that work has a fewlimitations that reduce its practical applicability. First, theyassume a black-box model that allows sampling or expand-ing arbitrary nodes in the game tree. Yet most simulatorsonly allow the players to start from the root of the game,and chance nodes in the simulator affect the path of play,so exploration by jumping around in the game tree is notsupported. Second, their algorithm requires an exact gamesolver, for example, a linear program (LP) solver, to be in-voked repeatedly as a subroutine. This reduces the ability ofthe algorithm to scale to cases in which LP is impracticaldue to run time or memory considerations.In this paper, we address both of these concerns. We givealgorithms that create certificates in extensive-form games ina simple black-box model, with either an exact game solveror a regret minimizer as a subroutine. We show that our al-gorithms achieve convergence rate O ( (cid:112) log( T ) /T ) (hidinggame-dependent constants). This matches, up to a logarith-mic factor, the convergence rate of regret minimizers such as counterfactual regret minimization (CFR) (Zinkevich et al.2007; Brown and Sandholm 2019a) or its stochastic vari-ant, Monte Carlo CFR (MCCFR) (Lanctot et al. 2009; Fa-rina, Kroer, and Sandholm 2020)—while also providing ver-ifiable equilibrium gap guarantees unlike those prior tech-niques. We prove that this convergence rate is optimal forthe setting. We demonstrate experimentally that our methodallows us to construct nontrivial certificates in games withgood sample efficiency, namely, while taking fewer samplesthan there are nodes in the game. In contrast, the conver-gence guarantees of CFR and MCCFR are vacuous if thenumber of samples is smaller than the tree size.As a side effect, we develop an algorithm for extensive-form game solving that enjoys many of the same propertiesof outcome-sampling MCCFR but works without the prob-lematic assumption of having an a-priori uniformly-lower-bounded “sampling vector” that is required by MCCFR. a r X i v : . [ c s . G T ] S e p ur techniques also work for games where payoffs can bereceived at internal nodes (not just at leaves), and for coarse-correlated equilibrium in general-sum multi-player games. We study extensive-form games , hereafter simply games . Anextensive-form game consists of:(1) a set of players P , usually identified with positive inte-gers , , . . . n . Nature , a.k.a. chance , will be referred toas player 0. For a given player i , we will often use − i todenote all players except i and nature.(2) a finite tree H of nodes , rooted at some root node ∅ . Theedges connecting a node h to its children are labeled with actions . The set of actions at h will be denoted A ( h ) . h (cid:22) z means z is a descendant of h , or z = h .(3) a map P : H → P ∪ { } , where P ( h ) is the player whoacts at node h (possibly nature).(4) for each player i , a utility function u i : H → R . It willbe useful for us to allow players to gain utility at internal nodes of the game tree. Along any path ( h , h , . . . , h k ) ,define u ( h → h k ) = (cid:80) ki =1 u ( h i ) to be the total utilitygained along that path, including both endpoints. The goalof each player is to maximize their total reward u ( ∅ → z ) .(5) for each player i , a partition of player i ’s decision points,i.e., P − ( i ) , into information sets . In each information set I , every h ∈ I must have the same set of actions.(6) for each node h at which nature acts, a distribution σ ( ·| h ) over the actions available to nature at node h .We will use ( G, u ) , or simply G when the utility functionis clear, to denote a game. G contains the tree and infor-mation set structure, and u = ( u , . . . , u n ) is the profile ofutility functions.For any history h ∈ H and any player i ∈ P , the sequence s i ( h ) of player i at node h is the sequence of informationsets observed and actions taken by i on the path from ∅ to h . In this paper, all games will be assumed to have perfectrecall : if h , h ∈ I and i acts at I , then s i ( h ) = s i ( h ) .A behavior strategy (hereafter simply strategy ) σ i forplayer i is, for each information set I ∈ J i at whichplayer i acts, a distribution σ i ( ·| I ) over the actions avail-able at that infoset. When an agent reaches information set I , it chooses action a with probability σ i ( a | I ) . A collection σ = ( σ , . . . , σ n ) of behavior strategies, one for each player i ∈ P , is a strategy profile . A distribution over strategy pro-files is called a correlated strategy profile , and will also bedenoted σ . The reach probability σ i ( h ) is the probabilitythat node h will be reached, assuming that player i playsaccording to strategy σ i , and all other players (includingnature) always choose actions leading to h when possible.Analogously, we define σ ( h ) = (cid:81) i ∈P∪{ } σ i ( h ) to be theprobability that h is reached under strategy profile σ . Thisdefinition naturally extends to sets of nodes or to sequencesby summing the reach probabilities of all relevant nodes.Let S i be the set of sequences for player i . The sequenceform of a strategy σ i is the vector x ∈ R S i given by x [ s ] = σ i ( s ) . The set of all sequence-form strategies is the sequence form strategy space for i , and is a convex polytope (Koller,Megiddo, and von Stengel 1994).The value of a profile σ for player i is u i ( σ ) := E z ∼ σ u i ( ∅ → z ) . The future utility of a profile starting at h is u ( σ | h ) ; that is, u ( σ | h ) = E z ∼ σ | h u ( h → z ) . The best response value u ∗ i ( σ − i ) for player i against anopponent strategy σ − i is the largest achievable value; i.e., u ∗ i ( σ − i ) = max σ i u i ( σ i , σ − i ) . A strategy σ i is an ε - best re-sponse to opponent strategy σ − i if u i ( σ i , σ − i ) ≥ u ∗ i ( σ − i ) − ε . A best response is a -best response.A strategy profile σ is an ε -Nash equilibrium (which wewill call ε -equilibrium for short) if all players are playing ε -best responses. A Nash equilibrium is a -Nash equilibrium.We also study finding certifiably good strategies for thegame-theoretic solution concept called coarse-correlatedequilibrium . In such equilibrium, if σ is correlated, the de-viations σ i when computing best response are not allowedto depend on the shared randomness. A correlated strategyprofile σ is a coarse-correlated ε -equilibrium if all playersare playing ε -best responses under this restriction. ε -equilibria within pseudogames We now define pseudogames , first introduced by Zhang andSandholm (2020).
Definition 2.1. A pseudogame ( ˜ G, α, β ) is a game in whichsome nodes do not have specified utility but rather haveonly lower and upper bounds on utilities. Formally, for eachplayer i , instead of the standard utility function u i , there arelower and upper bound functions α i , β i : H → R .We will always use ∆ to mean β − α . Definition 2.2. ( ˜
G, α, β ) is a trunk of a game ( G, u ) if:(1) ˜ G can be created by collapsing some internal nodes of G into terminal nodes (and removing them from informationsets they are contained in), and(2) for all nodes h of G , all players i , and all strategy profiles σ , we have α i ( σ | h ) ≤ u i ( σ | h ) ≤ β i ( σ | h ) .It is possible for information sets to be partially or totallyremoved in a trunk game. Next we state the basics of equi-librium and coarse-correlated equilibrium in pseudogames. Definition 2.3. A (coarse-correlated) ε -equilibrium of ( ˜ G, α, β ) is a (correlated) profile σ such that the equilibriumgap β ∗ i ( σ − i ) − α i ( σ ) of each player i is at most ε . Definition 2.4. A (coarse-correlated) ε -certificate for agame G is a pair ( ˜ G, σ ) , where ˜ G is a trunk of G and σ is a (coarse-correlated) ε -equilibrium of ˜ G . Proposition 2.5 (Zhang and Sandholm 2020) . Let ( ˜
G, σ ) be a ε -equilibrium for game G . Then any strategy profilein G created by playing according to σ in any informationset appearing in ˜ G and arbitrarily at information sets notappearing in ˜ G is a ε -equilibrium in G . Though the above proposition was stated only for Nashequilibrium by Zhang and Sandholm (2020), we observe thatit applies to coarse-correlated equilibria as well. .2 The zero-sum case
A two-player game is zero sum if u = − u . In this case,we refer to a single utility function u ; it is understood thatPlayer 2’s utility function is − u . In zero-sum games, allequilibria have the same expected value; this is called the value of the game , and we denote it by u ∗ . In the zero-sumcase, we use a slightly different notion of ε -equilibrium of apseudogame, which will make the subsequent results tighter. Definition 2.6.
A two-player pseudogame ( ˜
G, α, β ) is zero-sum if α = − β and β = − α .As alluded to above, in this situation, we will drop thesubscripts, and write α and β to mean α and β . In partic-ular, ( ˜ G, α ) and ( ˜ G, β ) are zero-sum games. Definition 2.7. An ε -equilibrium of a two-player zero-sumpseudogame ( ˜ G, α, β ) is a profile ( x ∗ , y ∗ ) for which the Nash gap β ∗ ( y ∗ ) − α ∗ ( x ∗ ) is at most ε .In zero-sum games, we need not concern ourselves withcorrelation, since at least one player can always deviate toplaying independently of the other player and not lose utility.In particular, a coarse-correlated ε -equilibrium remains an ε -equilibrium if the correlations are removed. Online convex optimization (OCO) (Zinkevich 2003) isa rich framework through which to understand decision-making in possibly adversarial environments.
Definition 2.8.
Let X ⊆ R n be a compact, convex set. A re-gret minimizer A X on X is an algorithm that acts as follows.At each time t = 1 , , . . . T , the algorithm A X outputs a de-cision x t ∈ X , and then receives a linear loss (cid:96) t : X → R ,which may be generated adversarially.The goal is to minimize the cumulative regret R T := max x ∈ X T (cid:88) t =1 (cid:2) (cid:96) t ( x t ) − (cid:96) t ( x ) (cid:3) . For example, CFR and its modern variants achieve O ( √ T ) regret in sequence-form strategy spaces.The connection between OCO and equilibrium-finding ingames is via the following observation. Let ( σ , . . . , σ T ) beany sequence of strategy profiles, and let ¯ σ be the correlatedstrategy profile that is uniform over σ , . . . , σ T . Supposethat player i generated her strategy at time i via a regret min-imizer, and achieved regret R T . Then, by definition of re-gret, i is playing an ε -best response to ¯ σ , where ε = R T /T .Thus, in particular, if all players are playing using a regretminimizer with sublinear regret, the average strategy profile ¯ σ converges to a coarse-correlated equilibrium. Let ( G, u ) be an n -player game, which we assume to begiven to us as a black box. Given a profile σ , the black boxallows us to sample a playthrough from G under σ . We alsoassume that, at every node h , we are given correct (but not necessarily tight) bounds [ α ( h → ∗ ) , β ( h → ∗ )] on the util-ity u ( h → z ) of every terminal descendant z (cid:23) h ; that is, α ( h → ∗ ) ≤ min z (cid:23) h u ( h → z ) ≤ max z (cid:23) h u ( h → z ) ≤ β ( h → ∗ ) . Our goal in this paper is to develop equilibrium-finding algo-rithms that give anytime, high-probability, instance-specific exploitability guarantees that can be computed without ex-panding the rest of the game tree, and are better than thegeneric guarantees given by the worst-case runtime boundsof algorithms like MCCFR. More formally, our goal is, af-ter t playthroughs, to efficiently maintain a strategy profile σ t and bounds ε i,t on the equilibrium gap of each player’sstrategy (or, in the zero-sum case, a single bound ε t on theNash gap) that are correct with probability − / poly( t ) . Before proceeding to algorithms, we prove a lower boundon the sample complexity of computing such a strategy pro-file. Let γ > be arbitrary. Consider a multi-armed banditinstance in which the left arm has some unknown rewarddistribution over { , } , and the right arm always gives util-ity / . Let p be the probability that the left arm gives . Wewill consider the two games, G − and G + , in which, respec-tively, the left arm gives p = 1 / − ε and p = 1 / ε , where ε = Θ( (cid:112) γ log( t ) /t ) , and the Θ hides only an absolute con-stant. Suppose t samples of the left arm are taken (the rightarm does not need to be sampled). We will say that the al-gorithm has selected the correct arm if σ t assigns a higherprobability to the better arm than it does to the worse arm.Then the following two facts are simultaneously true.(1) By binomial tail bounds, no algorithm can select the cor-rect arm with probability better than − Θ(1 /t γ ) .(2) In the event that an algorithm fails to select the correctarm at time t , its equilibrium gap is Θ( ε ) .Thus, we have the following theorem. Theorem 4.1.
Any algorithm that provides the guaranteesdescribed in Section 3 must have ε i,t = Ω( (cid:112) log( t ) /t ) . We will now describe algorithms matching this bound.
We now describe our main theoretical construction: a notionof confidence sequence for games, that enables us to con-struct high-probability certificates from playthroughs.
Definition 5.1. A confidence sequence for a game G is a se-quence of pseudogames ( ˆ G t , ˆ α t , ˆ β t ) created by the follow-ing protocol. Start with ˆ G containing only one node andtrivial reward bounds. At each time t :(1) Query an exploration policy A to obtain a profile σ t (2) Play a single game of G according to σ t .(3) Create ˆ G t from ˆ G t − as follows.(a) Expand all nodes on the path of play. It is also valid to expand only the first new node on the path ofplay. That does not change any of our theoretical results. b) For each chance node h in ˆ G t :(i) If h was encountered on the path of play, update ˆ σ t ( a | h ) according to the action observed at h (ii) Let ρ ( h ) = (cid:114) t h ( | A h | log 2 + log t C t n ) . (5.2)where t h is the number of times h has been sampled(including on this iteration), and C t is the numberof chance nodes in ˆ G t . Now set ˆ β ti ( h ) = u i ( h ) + ρ ( h )∆( h → ∗ ) , and ˆ α ti ( h ) = u i ( h ) − ρ ( h )∆( h → ∗ ) .We will use ( G t , α t , β t ) to denote the pseudogame withthe same game tree as ˆ G t , but with the exact correct natureprobabilities (that is, no sampling error, and ρ ( h ) = 0 ). Theorem 5.3 (Correctness) . For any time t , with probabilityat least − /t , for every profile σ and player i , we have ˆ α ti ( σ ) ≤ α i ( σ ) ≤ β i ( σ ) ≤ ˆ β ti ( σ ) . Proofs are in the appendix. In this case, we will call the se-quence correct at time t . These probabilities can be strength-ened to any inverse polynomial function of t by replacing t in Equation (5.2) with a suitably larger polynomial.Extra domain-specific information about the chance dis-tributions can easily be incorporated into the bounds. Forexample, if two chance nodes are known to have the sameaction distribution, their samples can be merged. If the dis-tribution of a chance node is known exactly, no samplingis necessary at all, and the number of chance nodes C t inEquation (5.2) may be decremented accordingly. Definition 5.4.
For an exploration policy A creating a con-fidence sequence ( ˆ G t , ˆ α t , ˆ β t ) , the cumulative uncertainty U i,T for player i after the first T iterations is given by U i,T := T (cid:88) t =1 ˆ∆ ti ( σ t ) . This can be thought of as the regret of an online optimizerthat plays σ t at time t , and then observes loss ˆ β ti − ˆ α ti . In asense, the next result is the main theorem of our paper, andwe find it the most surprising result of the paper. All ourconvergence guarantees stated later in the paper rely on it. Theorem 5.5.
Suppose that the true rewards are bounded in [0 , . Then for all times T , all players i , and any explorationpolicy A , we have E U i,T ≤ C T √ T M + N T where N T is the number of total nodes in ˆ G T , M = max chance nodes h (cid:0) | A h | log 2 + log 2 T C T n (cid:1) , and the expectation is over the sampling of games and (ifapplicable) the randomness of A . M is a constant that depends on the final pseudogame ˆ G T . Importantly, it does not depend on the actual game G !This makes it possible for our approach to give meaningfulexploitability guarantees while not exploring the full game.For fixed underlying game and confidence, M increases as Θ(log T ) , and hence U i,T increases as O ( √ T log T ) . The above discussion leads naturally to algorithms that gen-erate certificates, which we will now discuss.
Algorithm 6.1
Two-player zero-sum certificate finding Input: black-box zero-sum game Initialize confidence sequence ( ˆ G , ˆ α , ˆ β ) for t = 1 , , . . . do Solve ( ˆ G t − , ˆ α t − ) and ( ˆ G t − , ˆ β t − ) exactly to ob-tain equilibria (¯ x t − , ¯ y t − ) and (¯ x t − , ¯ y t − ) . Create next pseudogame ˆ G t by sampling oneplaythrough according to some profile σ t Definition 6.2.
The
Nash gap bound ε t at time t of Algo-rithm 6.1 is ε t = ˆ β ∗ t − ˆ α ∗ t Proposition 6.3.
Assuming that the confidence sequence iscorrect at time t , the pessimistic equilibrium (¯ x t , ¯ y t ) com-puted by Algorithm 6.1 is an ε t -equilibrium of G t . This allows us to know (with high probability) when wehave found an ε -equilibrium, without expanding the remain-der of the game tree, even in the case when chance’s strategyis not directly observable. The choice of exploration policyin Line 5 is very important. We now discuss that. Definition 6.4.
The optimistic exploration policy is σ t =(¯ x t − , ¯ y t − ) ; that is, both players explore according to theoptimistic pseudogame. Proposition 6.5.
Under the optimistic policy, ε t ≤ ˆ∆ t ( σ t ) . Together with Theorem 5.5, this immediately gives us aconvergence bound on Algorithm 6.1:
Corollary 6.6.
Suppose we use optimistic exploration, andthe true game G has rewards bounded in [0 , . Let ε ∗ T bethe known bound on the Nash gap of the best pessimisticequilibrium found so far; that is, ε ∗ T = min t ≤ T ε t . Then E ε ∗ T ≤ C T (cid:114) MT + 1 T N T . This is not the same kind of bound that is achieved byMCCFR and related algorithms. Those algorithms guaran-tee an upper bound on exploitability as a function of totalruntime ; here, we bound the number of samples . After ev-ery sample, our Algorithm 6.1 solves the entire pseudogamegenerated so far. This may be expensive (though, since thegame solves can be implemented as LP solves with warmstarts from the previous iteration, in practice they are stillreasonably efficient). However, as in Zhang and Sandholm(2020), our convergence guarantee has the distinct advan-tage of being dependent only on the current pseudogame,not the underlying full game. In this setting, the guaranteeof regret minimization algorithms such as MCCFR wouldbe vacuous until the total time exceeds the number of se-quences in the full game. Furthermore, as the experimentslater in this paper show, in practice, ε ∗ T is usually signifi-cantly smaller than its worst-case bound.In several special cases, Algorithm 6.1 corresponds natu-rally to known algorithms and results.1) Perfect information and deterministic : Assuming thegame solves return pure strategies (which is always pos-sible here), Algorithm 6.1 is exactly the same as Algo-rithm 6.7 of Zhang and Sandholm (2020). In particular, inthe two-player case, it is equivalent to incremental alpha-beta search; in the one-player case, it is equivalent to A*search (Hart, Nilsson, and Raphael 1968), where the up-per bound β ( h → ∗ ) corresponds to the heuristic lowerbound on the total distance from the root to the goal.(2) Nature probabilities known : Algorithm 6.1 is very similar(but not identical, due to the simpler black-box model) toAlgorithm 6.7 of Zhang and Sandholm (2020).(3)
Multi-armed stochastic bandit : Algorithm 6.1 is, upto a constant factor in Equation (5.2), equivalent toUCB1 (Auer, Cesa-Bianchi, and Fischer 2002), andCorollary 6.6 matches the worst-case O ( √ T log T ) de-pendence on T in the regret bound of UCB1. The worsedependence on the number of arms can be remedied by amore detailed analysis, which we skip here.In practice, due to the computational cost of the gamesolves, we recommend running several samples per gamesolve. This enhances computational efficiency in domainswhere the game is not prohibitively large for LPs, or sam-ples are relatively fast to obtain. A major weakness of Algorithm 6.1 is its reliance on anexact game solver as a subroutine, which can be slow oreven infeasible computationally. Could we replace the exactsolver with a single iteration of some iterative game solver,and still maintain the ˜ O (1 / √ T ) convergence rate? In thissection we show how to do this with regret minimizers. We now define a class of regret minimizers, which we coin extendable , which we can use to achieve the goal mentionedabove. Intuitively, for an extendable family of regret mini-mizers, expending a leaf of the pseudogame does not changethe behavior or regret of the regret minimizer, so long as thepast losses do not depend on the actions taken at the new in-formation set, which is always the case with our algorithmsbecause they have never visited the new information set.Thus, when working with a extendable family A , it makessense to speak about “running A on a game G ”, even if in-formation sets may be added to G over time. We will exploitthis language. For example, CFR (thus also MCCFR, sinceit is nothing but CFR with stochastic gradient estimates (Fa-rina, Kroer, and Sandholm 2020)) is a extendable family. Inthis case, the function φ described below simply initializesregrets at the new information set to .Formally, let L ( X ) be the set of linear functions on X .Consider a regret minimizer A X on X . We will think of A X as maintaining a state s t ∈ S X . At any time t , the algorithmoutputs strategy x t ← x X ( s t ) for some map x X : S X → X ,and after observing loss (cid:96) t , the algorithm updates the statevia s t +1 ← u X ( s t , (cid:96) t ) , where u X : S X × L ( X ) → S X isan update function. As such, A X can be thought of as a pair ( x X , u X ) . For example, when X is the n -simplex and A X is regret matching (Hart and Mas-Colell 2000), S X is R n ,the update function is u X ( s t , (cid:96) t ) = s t + (cid:96) t − (cid:104) (cid:96) t , x X ( s t ) (cid:105) ,and the strategy is x C ( s t )( i ) ∝ [ s t ( i )] + . Definition 7.1.
Let A = {A X } be a family of regret mini-mizers, one for each extensive-form strategy space X . A is extendable if for every X and every X (cid:48) ⊆ X × R m formedby adding a decision point (with m actions) to X , there is afunction φ : S X → S X (cid:48) such that for every state s ∈ S X :(1) x X (cid:48) ( φ ( s )) agrees with x X ( s ) in X , and(2) for every loss function (cid:96) ∈ L ( X ) , we have φ ( u X ( s, (cid:96) )) = u X (cid:48) ( φ ( s ) , ( (cid:96), , where ( (cid:96), ∈L ( X (cid:48) )= L ( X ) ×L ( R m ) . Algorithm 7.2
Certificate-finding with regret minimization Input: black-box game, extendable family A i for eachplayer i Initialize confidence sequence ( ˆ G , ˆ α , ˆ β ) for t = 1 , , . . . do Query each A i to obtain a strategy σ ti Submit loss − ˆ β ti ( · , σ t − i ) to A i Create next pseudogame ˆ G t by sampling oneplaythrough according to σ t Even in the two-player zero-sum case, this algorithm is not the exact generalization of Algorithm 6.1. That general-ization would involve independently solving the lower- andupper-bound games ( ˆ G t , ˆ α t ) and ( ˆ G t , ˆ β t ) using a total of four regret minimizers, not two. This algorithm has no needto store or refer to pessimistic strategies. It suffices to useonly the optimistic strategy. As usual when dealing with re-gret minimization, we will discuss convergence of the aver-age (optimistic) strategy played by each player. Proposition 7.3.
Suppose that the true rewards are boundedin [0 , . After t iterations of the for loop on Line 3, assum-ing the correctness of the confidence sequence at time t , theaverage optimistic profile ¯ σ t forms a coarse-correlated ap-proximate equilibrium of G t , in which the equilibrium gapfor player i is at most ε i,t = ˆ β ∗ ti (¯ σ t − i ) − ˆ α ti (¯ σ t ) . Thus, Algorithm 7.2 is an anytime algorithm whose equi-librium gap bound at any time t can be easily computedby linear passes through the pseudogame ˆ G t . In the two-player zero-sum case (wherein, for notation, β = β and α = − β and ¯ σ t = (¯ x t , ¯ y t ) ), we can use the slightly tighter ε t = ˆ β ∗ t (¯ y t ) − ˆ α ∗ t (¯ x t ) as a Nash gap bound. Annoyingly, it is not the case in general that ε i,t =˜ O ( N t / √ t ) . Appendix B.1 provides a counterexample. In-tuitively, the reason is that, for a fixed strategy σ , the upperbound ˆ β t ( σ ) is not a monotonically nonincreasing functionof t ; indeed, for strategies σ that are not sampled very fre-quently, ˆ β t ( σ ) may fluctuate by large amounts even whenigure 1: Convergence of Algorithm 6.1 and Algorithm 7.2 in 4-rank Goofspiel and 13-card limit Leduc. To be consistentwith the other algorithms, one “iteration” of MCCFR consists of one accepted loss vector per player. For the other algorithms,one “iteration” is one playthrough. In all cases, we show both the provable equilibrium gap ˆ β ∗ t ( σ t ) − ˆ α ∗ t ( σ t ) and the true equilibrium gap β ∗ t ( σ t ) − α ∗ t ( σ t ) . The exception is MCCFR, which on its own does not give provable equilibrium gaps inthe same way. The horizontal line is at the game’s reward range (Goofspiel has reward range [ − , and 13-rank Leduc has [ − , , so the lines are at and , respectively), and the vertical line is at the number of nodes in the game (Goofspiel has54,421 nodes and 13-rank Leduc has 166,366). Sampling-limited Compute-limitedUnknown nature distributions
Algorithm 6.1 with LP solver Algorithm 7.2 with a CFR variant( e.g. , outcome-sampling MCCFR)
Known nature distributions
Algorithm 6.7 of Zhang and Sandholm (2020)Table 1: Algorithms we suggest by use case in two-player zero-sum games.
Sampling-limited means that the black-box gamesimulator is relatively slow or expensive compared to solving the pseudogames.
Compute-limited means that the simulator isfast or cheap compared to solving the pseudogames. In general-sum games, only Algorithm 7.2 is usable. t is large. However, the nonmonotonicity of ˆ β t is, in somesense, necessary to achieve the high-probability correctnessguarantee. If ˆ β t does not increase over time, then the prob-ability that it is an incorrect bound remains constant, ratherthan decreasing polynomially with time as would be desired.To study the convergence rate of Algorithm 7.2, then, wewill instead analyze the quantity ¯ ε i,T = max σ i T T (cid:88) t =1 (cid:104) ˆ β ti ( σ i , σ t − i ) − ˆ α ti ( σ t ) (cid:105) + O (cid:18) √ T (cid:19) = 1 T [ R i,T + U i,T ] + O (cid:18) √ T (cid:19) where the O hides only an absolute constant. This quantity isidentical to ε i,t except that it uses ˆ β ti with σ t − i instead of ˆ β Ti to match the regret term, and has an extra error term added. Proposition 7.4.
With probability − O (1 / √ T ) , ¯ ε i,T is anactual equilibrium gap bound. By Theorem 5.5, U T = ˜ O ( N T √ T ) . Thus, this theoremmatches the worst-case convergence of any algorithm withregret ˜ O ( N T / √ T ) , up to a logarithmic factor. For example,using CFR and variants thereof matches the bound of Corol-lary 6.6 with iterates that are linear time in the size of the pseudogame. With MCCFR, the iterates can be made evenfaster, and due to Farina, Kroer, and Sandholm (2020), evenoutcome-sampling MCCFR can be used without breakingthe ˜ O ( N T / √ T ) runtime bound.Unfortunately, there is a further problem. It is often un-wieldy to compute ¯ ε i,T . For example, if using outcome-sampling MCCFR, one may not even have access to thetrue bounds ˆ β t ( · , σ t − i ) (and similar for α ) but only stochas-tic estimates ˜ β t ( · , ˜ σ t − i ) with the correct conditional expec-tation (Farina, Kroer, and Sandholm 2020). In that case, thestochastic estimate may be used as a substitute to create astochastic equilibrium gap bound ˜ ε i,T = max σ i T T (cid:88) t =1 (cid:104) ˜ β ti ( σ i , σ t − i ) − ˜ α ti ( σ t ) (cid:105) + O (cid:32) M (cid:114) T log T (cid:33) where M is a bound on the norm of the estimates; i.e. , (cid:12)(cid:12)(cid:12) ˜ β ti ( σ i , ˜ σ ti ) − ˜ β ti ( σ (cid:48) i , ˜ σ ti ) (cid:12)(cid:12)(cid:12) ≤ M for every pair of strategies σ i , σ (cid:48) i . As discussed by Farina, Kroer, and Sandholm (2020),with a uniform sampling vector, we can achieve M ≤ N T . Proposition 7.5.
With probability − /T , for every time T and player i , we have ˜ ε i,T ≥ ¯ ε i,T . Thus, in particular, we have: orollary 7.6. ε ∗ i,T := min( ε i,T , ˜ ε i,T ) = ˜ O ( N T / √ T ) isan equilibrium gap bound with probability − O (1 / √ T ) . This is the desired result. In practice, ˜ ε i,T is trivial until T = Ω( N T ) , and ε i,T is almost always a better bound. Thus,in our experiments, we use only ε i,T . For this reason and forclarity, we have not bothered to specify the constants in thebig- O s. Nevertheless, it is desirable theoretically to be ableto define a quantity ε ∗ i,T that has both ˜ O ( N T / √ T ) conver-gence and (high-probability) correctness. As before, the cor-rectness probability can be raised to any inverse-polynomialfunction of T by a suitable change to Equation (5.2).As an equilibrium-finding algorithm, Algorithm 7.2 is a“weaker” version of just running the underlying regret min-imizers on the full game: instead of each regret minimizergetting access to the true losses, they only get access to anupper bound. However, its main advantage over regret min-imization is, as before, its ability to give a equilibrium gapbound that can be computed without full knowledge of theremainder of the game or exact nature action probabilities.Finally, Algorithm 7.2 has an unintuitive property. Warning 7.7. If A i are stochastic regret minimizers (e.g.MCCFR), instead of submitting − ˆ β ti ( · , σ t − i ) , it may betempting to submit a noisy (sampled) version of − β ti ( · , σ t − i ) .Then the actual equilibrium gap β ∗ ti (¯ σ t − i ) − α ti (¯ σ t ) will con-verge, but the provable equilibrium gap ¯ ε i,t may not. For acounterexample, see Appendix B.2. If the nature probabilities are assumed to be known exactly,Warning 7.7 does not apply, since the actual bounds ( α t , β t ) and the sampled bounds (ˆ α t , ˆ β t ) are the same. Even in thiscase, Algorithm 7.2 is still noteworthy: if we run it withoutcome-sampling MCCFR, the result is an MCCFR-likealgorithm ( i.e., an equilibrium finder in the black-box case)that operates without an a-priori “uniform sampling strat-egy”. Indeed, the iterations only require a uniform samplingstrategy over the current pseudogame , not the full game!That algorithm is not quite a regret minimizer in the usualsense: its convergence rate depends on the uncertainty of thesampling method, and is tied to the fact that the sampling inLine 6 of Algorithm 7.2 uses the current strategy. We conducted experiments on two common benchmarks:(1) k -rank Goofspiel . At each time t = 1 , . . . , k , both playerssimultaneously place a bid for a prize. The prizes havevalues , . . . , k , and are randomly shuffled. The valid bidsare also , . . . , k , each of which must be used exactly onceduring the game. The higher bid wins the prize; in case ofa tie, the prize is split. The winner of each round is madepublic, but the bids are not. Our experiments use k = 4 .(2) k -rank heads-up limit Leduc poker (Southey et al. 2005),a small two-player variant of poker played with one holecard per player and one community card. Our experimentsuse a full range of poker ranks ( k = 13) . We tested four algorithm variants. Except in the last case,which we will describe, all certificate-finding algorithms as-sume that the nature distributions are independent of playeractions. In Goofspiel, we assume further that the nature dis-tributions are independent of past nature actions, which istrue (nature always plays uniformly at random).(1) MCCFR with outcome sampling (OS-MCCFR) (Lanctotet al. 2009) ( MCCFR ). This algorithm requires the gametree to be fully expanded, and does not give a (nontrivial)certificate. However, it does give a benchmark for actualequilibrium gap convergence.(2) Algorithm 7.2 with OS-MCCFR as the regret minimizer(
Cert-MCCFR ).(3) Algorithm 6.1, with LP for the game solves (
Cert-LP ).Since the LP solves are relatively expensive, we only re-compute the LP solution every playthroughs sampled.This does not change the asymptotic performance of thealgorithm. We use Gurobi v9.0.0 (Gurobi Optimization,LLC 2019) as the LP solver.(4) Algorithm 6.1, except with no assumptions on relation-ships between nature distributions (
Cert-LP-Indep ).Figure 1 shows the results. As expected, all the algorithmsshow a long-term convergence rate of roughly ˜Θ(1 / √ t ) . Allcertificate-finding algorithms find nontrivial provable certifi-cates with fewer samples than it would take to expand thewhole game tree, showing the efficacy of our method. We developed algorithms that construct high-probabilitycertificates in games with only black-box access. Ourmethod can be used with either an exact game solver (e.g.,LP solver) as a subroutine or a regret minimizer such asMCCFR. Table 1 shows which algorithm we recommendbased on the use case. As a side effect, we developed anMCCFR-like equilibrium-finding algorithm that convergesat rate ˜ O ( (cid:112) log( t ) /t ) , and does not require a lower-boundedsampling vector. Our experiments show that our algorithmsproduce nontrivial certificates with very few samples.This work opens many avenues for future research.(1) Is there a “cleaner” way to fix the problem introducedin Section 7.3? For example, a different confidence se-quence may fix the problem, or it could be the case that ε i,T is small for most times t (or even only a constant frac-tion), which would show that min t ≤ T ε i,T = ˜ O (1 / √ T ) ,matching Corollary 6.6.(2) Is it possible to adapt Algorithm 7.2 to work with ageneric extensive-form iterative game solver, for exam-ple, first-order methods such as EGT (Hoda et al. 2010;Kroer, Farina, and Sandholm 2018)?(3) In many practical games, there are nature nodes h forwhich, under a particular profile σ , every child of h hassimilar utility: the range of utilities of the children of h under σ is far smaller than [ α ( h → ∗ ) , β ( h → ∗ )] . Isit possible to incorporate this sort of information into theconfidence-sequence pseudogames without losing perfectrecall (which is needed for efficient solving?) eferences Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-timeanalysis of the multiarmed bandit problem.
Machine learn-ing arXiv preprint arXiv:1912.06680 .Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O.2015. Heads-up Limit Hold’em Poker is Solved.
Science
Science eaao1733.Brown, N.; and Sandholm, T. 2019a. Solving imperfect-information games via discounted regret minimization. In
AAAI Conference on Artificial Intelligence (AAAI) .Brown, N.; and Sandholm, T. 2019b. Superhuman AI formultiplayer poker.
Science arXiv preprintarXiv:2002.08493 .Gurobi Optimization, LLC. 2019. Gurobi Optimizer Refer-ence Manual.Hart, P.; Nilsson, N.; and Raphael, B. 1968. A FormalBasis for the Heuristic Determination of Minimum CostPaths.
IEEE Transactions on Systems Science and Cyber-netics
Econometrica
68: 1127–1150.Hoda, S.; Gilpin, A.; Pe˜na, J.; and Sandholm, T. 2010.Smoothing Techniques for Computing Nash Equilibria ofSequential Games.
Mathematics of Operations Research
Proceedings of the 26 th ACM Symposium on Theory ofComputing (STOC) .Kroer, C.; Farina, G.; and Sandholm, T. 2018. Solving LargeSequential Games with the Excessive Gap Technique. In
Proceedings of the Annual Conference on Neural Informa-tion Processing Systems (NIPS) .Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M.2009. Monte Carlo Sampling for Regret Minimization inExtensive Games. In
Proceedings of the Annual Conferenceon Neural Information Processing Systems (NIPS) .Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.;Tuyls, K.; P´erolat, J.; Silver, D.; and Graepel, T. 2017. A uni-fied game-theoretic approach to multiagent reinforcementlearning. In
Proceedings of the Annual Conference on Neu-ral Information Processing Systems (NIPS) , 4190–4203. Southey, F.; Bowling, M.; Larson, B.; Piccione, C.; Burch,N.; Billings, D.; and Rayner, C. 2005. Bayes’ Bluff: Oppo-nent Modelling in Poker. In
Proceedings of the 21st AnnualConference on Uncertainty in Artificial Intelligence (UAI) .Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.;Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds,T.; Georgiev, P.; et al. 2019. Grandmaster level in Star-Craft II using multi-agent reinforcement learning.
Nature
Proceedings of the Na-tional Conference on Artificial Intelligence (AAAI) , 1552–1555.Zhang, B. H.; and Sandholm, T. 2020. Small Nash Equi-librium Certificates in Very Large Games. arXiv preprintarXiv:2006.16387 .Zinkevich, M. 2003. Online Convex Programming andGeneralized Infinitesimal Gradient Ascent. In
InternationalConference on Machine Learning (ICML) , 928–936. Wash-ington, DC, USA.Zinkevich, M.; Bowling, M.; Johanson, M.; and Piccione,C. 2007. Regret Minimization in Games with IncompleteInformation. In
Proceedings of the Annual Conference onNeural Information Processing Systems (NIPS) . Proofs of Theorems
A.1 Theorem 5.3
Lemma A.1.
Fix a player i and chance node h . With probability at least − /t Cn , for any assignment u : Children( h ) → [ α, β ] of utilities, we have (cid:12)(cid:12)(cid:12)(cid:12) E a ∼ σ | h u ( ha ) − E a ∼ ˆ σ | h u ( ha ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( β − α ) ρ ( h ) . Proof. If ρ = 1 the claim is trivial, so assume ρ < . The desired error term is a convex function of u , so we need only provethe theorem for u : Children( h ) → { α, β } . By definition, ˆ σ | h was created by sampling t ( h ) times. Thus, by Hoeffding, wehave Pr (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) E a ∼ σ | h u ( ha ) − E a ∼ ˆ σ | h u ( ha ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ ( β − α ) ρ (cid:21) ≤ (cid:0) − t ( h ) ρ ( h ) (cid:1) = 2 exp (cid:0) −| A h | log 2 − log t Cn (cid:1) = 2 −| A ( h ) | t Cn Taking a union bound over the | A h | choices of u completes the proof.Thus, by a union bound, with probability − /t , the above lemma is true for every player and chance node. Condition onthis event, and take any player i and any profile σ . For notation, let ˆ σ be the strategy profile in which chance plays according to ˆ σ and the players play according to σ . Lemma A.2.
At every node h , we have the bounds ˆ α i ( σ | h ) ≤ α i ( σ | h ) ≤ β i ( σ | h ) ≤ ˆ β i ( σ | h ) .Proof. By induction, leaves first. At the leaves, the lemma is trivial. Let h be any internal node. Then we have ˆ α i ( σ | h ) = E a ∼ ˆ σ i | h ˆ α i ( σ | ha ) − ρ ( h )∆ i ( h → ∗ ) ≤ E a ∼ ˆ σ i | h α i ( σ | ha ) − ρ ( h )∆ i ( h → ∗ ) ≤ E a ∼ σ i | h α i ( σ | ha )= α i ( σ | h ) . where the first two inequalities use, in order, the inductive hypothesis and the last lemma. An identical proof holds for β , andwe are done.The theorem now follows by applying the above lemma with h = ∅ . A.2 Theorem 5.5
Assume WLOG there is only one player, and drop the subscript i accordingly. Define the sampled cumulative uncertainty ˆ U T as ˆ U T := T (cid:88) t =1 ˆ∆ t ( z t ) where z t is the last node in ˆ G t reached during the play at time t . By linearity of expectation, we have E ˆ U T = E U T . Define ˆ U ∆ K ( h ) to be the sampled regret at node h after node h is sampled K times. Formally, ˆ U K ( h ) := K (cid:88) k =1 ˆ∆ t h,k ( h → z t h,k ) where t h,k is the k th timestep on which h was sampled. Conveniently, ˆ U K ( h ) can be analyzed independently of the rest of thegame. Our goal is to bound ˆ U T = ˆ U T ( ∅ ) .Let N k ( h ) be the number of descendants of h , including h itself, at time t h,k . Let C k ( h ) be the same, except only countingchance nodes. Let ρ k ( h ) be the value of ρ ( h ) after k samples at h . Once again, these quantities are independent of what happensoutside the subgame rooted at h . We now prove a lemma, which has the theorem as the special case h = ∅ . emma A.3. For every exploration policy A , any node h of G , and any time K , we have E ˆ U K ( h ) ≤ C k ( h ) √ KM + N K ( h ) . Proof.
By induction on the nodes of the game tree, leaves first. For each child ha of h , let K a be the number of times action a has been sampled. Base case. If h is a leaf of G , then uncertainty at most will be incurred when the leaf is expanded for the first time. Inductive case. E ˆ U K ( h ) ≤ ∆( h → ∗ ) (cid:32) K (cid:88) k =1 ρ k ( h ) (cid:33) + (cid:88) a ∈ A h E ˆ U K a ( ha ) ≤ K (cid:88) k =1 (cid:114) M k + (cid:88) a ∈ A h (cid:104) C K a ( ha ) (cid:112) K a M + N K a ( ha ) . (cid:105) ≤ C K ( h ) √ KM + N K ( h ) where the three terms come from:(1) a regret of at most , incurred when h is first expanded,(2) the regret incurred at h itself, if it is a chance node, and(3) the regret incurred at each child node.Once again, the theorem is the above lemma applied with h = ∅ . A.3 Proposition 6.3 and Proposition 7.3
Follow immediately from Theorem 5.3.
A.4 Proposition 6.5
Follows immediately from the definition of a pseudogame.
A.5 Proposition 7.4
Taking a union bound over times t ≥ √ T in Theorem 5.3, we have that, with probability − O (1 / √ T ) , ˆ β ti ( σ i , σ t − i ) − ˆ α ti ( σ t ) ≥ β t ( σ i , σ t − i ) − α t ( σ t ) for all t ≥ √ T . The bound follows. A.6 Proposition 7.5
Identical to Theorem 1 of Farina, Kroer, and Sandholm (2020).
B Counterexamples
B.1 Rate of convergence of the upper bound in Proposition 7.3
Consider the following multi-armed bandit instance with two arms, formulated as a one-player game: the left arm gives loss − K with probability /K , and with probability − /K . The right arm gives loss − deterministically.With probability Θ(1 /K ) , the first Θ( K )+1 samples of the left arm give rewards exactly ( − K, − K, . . . , − K, . Conditionon this event.After Θ( K ) samples of the left arm, its upper bound will be − K + Θ (cid:32) K (cid:114) K log T (cid:33) = − K + Θ( (cid:112) log T ) The Θ( K )+1 st sample will not happen until the upper bound exceeds at least − , which only happens once T > exp (cid:0) Θ( K ) (cid:1) .Upon taking the Θ( K ) + 1 st sample, the upper bound on the left arm’s utility will increase by Θ(1) . But the reward range ofthis game is [ − K, , so now taking any K = o ( √ T ) completes the counterexample. B.2 Warning 7.7
For example, consider the one-player multi-armed bandit case with two arms of differing utilities u ( L ) < u ( R ) . Then thefollowing two statements are simultaneously true:(1) With MCCFR, with probability , there will exist some time T after which L will no longer be played ever again.(2) ˆ β t ( L ) will increase without bound if it is not played.Thus, eventually, we will have ˆ β t ( L ))
For example, consider the one-player multi-armed bandit case with two arms of differing utilities u ( L ) < u ( R ) . Then thefollowing two statements are simultaneously true:(1) With MCCFR, with probability , there will exist some time T after which L will no longer be played ever again.(2) ˆ β t ( L ) will increase without bound if it is not played.Thus, eventually, we will have ˆ β t ( L )) > ˆ β t ( R ))