Safe Search for Stackelberg Equilibria in Extensive-Form Games
aa r X i v : . [ c s . G T ] F e b Safe Search for Stackelberg Equilibria in Extensive-Form Games
Chun Kai Ling, Noam Brown Carnegie Mellon University Facebook AI [email protected], [email protected]
Abstract
Stackelberg equilibrium is a solution concept in two-playergames where the leader has commitment rights over the fol-lower. In recent years, it has become a cornerstone of manysecurity applications, including airport patrolling and wildlifepoaching prevention. Even though many of these settingsare sequential in nature, existing techniques pre-computethe entire solution ahead of time. In this paper, we presenta theoretically sound and empirically effective way to ap-ply search, which leverages extra online computation to im-prove a solution, to the computation of Stackelberg equilib-ria in general-sum games. Instead of the leader attemptingto solve the full game upfront, an approximate “blueprint”solution is first computed offline and is then improved on-line for the particular subgames encountered in actual play.We prove that our search technique is guaranteed to performno worse than the pre-computed blueprint strategy, and em-pirically demonstrate that it enables approximately solvingsignificantly larger games compared to purely offline meth-ods. We also show that our search operation may be cast asa smaller Stackelberg problem, making our method comple-mentary to existing algorithms based on strategy generation.
Strong Stackelberg equilibria (SSE) have found many usesin security domains, such as wildlife poaching protec-tion (Fang et al. 2017) and airport patrols (Pita et al. 2008).Many of these settings, particularly those involving pa-trolling, are sequential by nature and are best representedas extensive-form games (EFGs). Finding a SSE in generalEFGs is provably intractable (Letchford and Conitzer 2010).Existing methods convert the problem into a normal-formgame and apply column or constraint generation techniquesto handle the exponential blowup in the size of the normal-form game (Jain, Kiekintveld, and Tambe 2011). More re-cent methods cast the problem as a mixed integer linear pro-gram (MILP) (Bosansky and Cermak 2015). Current state-of-the-art methods build upon this by heuristically generat-ing strategies, and thus avoid considering all possible strate-gies ( ˇCern`y, Boˇsansk`y, and Kiekintveld 2018).All existing approaches for computing SSE are entirelyoffline. That is, they compute a solution for the entire * game ahead of time and always play according to thatoffline solution. In contrast, search additionally leveragesonline computation to improve the strategy for the specificsituations that come up during play. Search has been akey component for AI in single-agent settings (Lin 1965;Hart, Nilsson, and Raphael 1968), perfect-informationgames (Tesauro 1995; Campbell, Hoane Jr, and Hsu2002; Silver et al. 2016, 2018), and zero-sumimperfect-information games (Moravˇc´ık et al. 2017;Brown and Sandholm 2017b, 2019). In order to applysearch to two-player zero-sum imperfect-information gamesin a way that would not do worse than simply playingan offline strategy, safe search techniques were devel-oped (Burch, Johanson, and Bowling 2014; Moravcik et al.2016; Brown and Sandholm 2017a). Safe search beginswith a blueprint strategy that is computed offline. Thesearch algorithm then adds extra constraints to ensure thatits solution is no worse than the blueprint (that is, thatit approximates an equilibrium at least as closely as theblueprint). However, safe search algorithms have so far onlybeen developed for two-player zero-sum games.In this paper, we extend safe search to SSE computationin general-sum games. We begin with a blueprint strategyfor the leader, which is typically some solution (computedoffline) of a simpler abstraction of the original game. Theleader follows the blueprint strategy for the initial stagesof the game, but upon reaching particular subgames of thegame tree, computes a refinement of the blueprint strategy online , which is then adopted for the rest of the game.We show that with search, one can approximate SSEs ingames much larger than purely offline methods. We alsoshow that our search operation is itself solving a smallerSSE, thus making our method complementary to other meth-ods based on strategy generation. We evaluate our method ona two-stage matrix game, the classic game of Goofspiel, anda larger, general-sum variant of Leduc hold’em. We demon-strate that in large games our search algorithm outperformsoffline methods while requiring significantly less computa-tion, and that this improvement increases with the size ofthe game. Our implementation is publicly available online:https://github.com/lingchunkai/SafeSearchSSE. Background and Related Work
As is standard in game theory, we assume that the strate-gies of all players, including the algorithms used to computethose strategies, are common knowledge. However, the out-comes of stochastic variables are not known ahead of time.EFGs model sequential interactions between players, andare typically represented as game trees in which each nodespecifies a state of the game where one player acts (exceptterminal nodes where no player acts). In two-player EFGs,there are two players, P = { , } . H is the set of all possiblenodes h in the game tree, which are represented as sequencesof actions (possibly chance). A ( h ) is the set of actions avail-able in node h and P ( h ) ∈ P ∪ { c } is the player actingat that node, where c is the chance player. If a sequence ofactions leads from h to h ′ , then we write h ⊏ h ′ . We de-note Z ⊆ H to be the set of all terminal nodes in the gametree. For each terminal node z , we associate a payoff foreach player, u i : Z → R . For each node h , the function C : H → [0 , is the probability of reaching h , assumingboth players play to do so.Nodes belonging to player i ∈ P , i.e., { h ∈ H, P ( h ) = i } are partitioned into information sets I i . All nodes h be-longing to the same information set I i ∈ I i are indistin-guishable and players must behave the same way for allnodes in I i . Furthermore, all nodes in the same informationare required to have the same actions, if h, h ′ ∈ I i , then A ( h ) = A ( h ′ ) . Thus, we overload A ( I i ) to define the setof actions in I i . We assume that the game exhibits perfectrecall , i.e., players do not ‘forget’ past observations or ownactions; for each player i , the information set I i is precededby a unique series of actions and information sets of i . Sequence Form Representation.
Strategies in gameswith perfect recall may be compactly represented in sequence-form (Von Stengel 1996). A sequence σ i is an (or-dered) list of actions taken by a single player i in order toreach node h . The empty sequence ∅ is the sequence withoutany actions. The set of all possible sequences achievable byplayer i is given by Σ i . We write σ i a = σ ′ i if a sequence σ ′ i ∈ Σ i may be obtained by appending an action a to σ i .With perfect recall, all nodes h in information sets I i ∈ I i may be reached by a unique sequence σ i , which we denoteby Seq i ( I i ) or Seq i ( h ) . Conversely, Inf i ( σ ′ i ) denotes the in-formation set containing the last action taken in σ ′ i . Usingthe sequence form, mixed strategies are given by realizationplans , r i : Σ i → R , which are distributions over sequences.Realization plans for sequences σ i give the probability thatthis sequence of moves will be played, assuming all otherplayers played such as to reach Inf i ( σ i ) . Mixed strategiesobey the sequence form constraints, ∀ i , r i ( ∅ ) = 1 , ∀ I i ∈ I i , r i ( σ i ) = P a ∈ A i ( I i ) r i ( σ i a ) and σ i = Seq i ( I i ) .Sequence forms may be visualized using treeplexes (Hoda et al. 2010), one per player. Infor-mally, a treeplex is a tree rooted at ∅ with subsequent nodesalternating between information sets and sequences, andare operationally useful for providing recursive implemen-tations for common operations in EFGs such as findingbest responses. Since understanding treeplexes is helpful inunderstanding our method, we provide a brief introduction in the Appendix. Stackelberg Equilibria in EFGs.
Strong StackelbergEquilibria (SSE) describe games in which there is asymme-try in the commitment powers of players. Here, players and play the role of leader and follower , respectively. Theleader is able to commit to a (potentially mixed) strategyand the follower best-responds to this strategy, while break-ing ties by favoring the leader. By carefully commiting to amixed strategy, the leader implicitly issues threats, and fol-lowers are made to best-respond in a manner favorable to theleader. SSE are guaranteed to exist and the value of the gamefor each player is unique. In one-shot games, a polynomial-time algorithm for finding a SSE is given by the multiple-LPapproach (Conitzer and Sandholm 2006).However, solving for SSE in EFGs in general-sum gameswith either chance or imperfect information is known to beNP-hard in the size of the game tree (Letchford and Conitzer2010) due to the combinatorial number of pure strategies.Bosansky and Cermak (2015) avoid transformation to nor-mal form and formulate a compact mixed-integer linear pro-gram (MILP) which uses a binary sequence-form followerbest response variable to modestly-sized problems. Morerecently, ˇCern`y, Boˇsansk`y, and Kiekintveld (2018) proposeheuristically guided incremental strategy generation. Safe Search.
For this paper, we adopt the role of the leaderand seek to maximize his expected payoff under the SSE.We assume that the game tree may be broken into severaldisjoint subgames . For this paper, a subgame is defined as aset of states H sub ⊆ H such that (a) if h ⊏ h ′ and h ∈ H sub then h ′ ∈ H sub , and (b) if h ∈ I i and h ∈ H sub , then for all h ′ ∈ I i , h ′ ∈ H sub . Condition (a) implies that one cannotleave a subgame after entering it, while (b) ensures that in-formation sets are ‘contained’ within subgames—if any his-tory in an information set belongs to a subgame, then everyhistory in that information set belongs to that subgame. Forthe j -th subgame H j sub , I ji ⊆ I i is the set of informationsets belonging to player i within subgame j . Furthermore,let I ji, head ⊆ I ji be the ‘head’ information sets of player i in subgame j , i.e., I i ∈ I ji, head if and only if Inf i ( Seq i ( I i )) does not exist or does not belong to I ji . With a slight abuseof notation, let I ji,head ( z ) be the (unique, if existent) infor-mation set in I ji, head preceding leaf z .At the beginning, we are given a blueprint strategy for theleader, typically the solution of a smaller abstracted game.The leader follows the blueprint strategy in actual play un-til reaching some subgame. Upon reaching the subgame, theleader computes a refined strategy and follows it thereafter.The pseudocode is given in Algorithm 1. The goal of thepaper is to develop effective algorithms for the refinementstep (*). Algorithm 1 implicitly defines a leader strategy dis-tinct from the blueprint. Crucially, this implies that the fol-lower responds to this implicit strategy and not the blueprint.Search is said to be safe when the leader applies Algorithm 1such that its expected payoff is no less than the blueprint,supposing the follower best responds to the algorithm. nput: EFG specification, leader blueprint while game is not over doif currently in some subgame j thenif first time in this subgame then (*) Refine leader strategy for subgame j end Play action according to refined strategy else
Play action according to blueprint endend Algorithm 1:
Generic search template. X S . X S . CF F (0 ,
0) (2 , , A (0 , , − · , ≥
0) (1 , B ( · , ≤ BlueprintNaive searchSafe Bounds
Figure 1: Unsafe na¨ıve search and its game tree. Boxed re-gions denote subgames. Expected values for each player un-der (i) the blueprint strategy and its best response and (ii)under na¨ıve search is shown in the box, as are bounds guar-anteeing no change of follower strategies after refinement.
To motivate our algorithm, we first explore how unsafe be-havior may arise.
Na¨ıve search assumes that prior to enter-ing a subgame, the follower plays the best-response to theblueprint. For each subgame, the leader computes a normal-ized distribution of initial (subgame) states and solves a newgame with initial states in obeying this distribution.Consider the 2-player EFG in Figure 1, which begins withchance choosing each branch with equal probability. Thefollower then decides to e (X) it, or (S) tay, where the lat-ter brings the game into a subgame, denoted by the dottedbox. Upon reaching A, the follower recieves an expectedvalue (EV) of when best responding to the blueprint. Uponreaching B, the follower recieves an EV of when best re-sponding to the blueprint. Thus, under the blueprint strategy,the follower chooses to stay(exit) on the left(right) branches,and the expected payoff per player is (1 . , . . Example 1.
Suppose the leader performs na¨ıve search inFigure 1, which improves the leader’s EV in A from to but reduces the follower’s EV in A from to − . The fol-lower is aware that the leader will perform this search andthus chooses X over S even before entering A, since ex-iting gives a payoff of . Conversely, suppose this searchimproves the leader’s EV in B from to and also improvesthe follower’s EV from to . Then the higher post-searchpayoff in B causes the follower to switch from X to S .These changes cause the leader’s EV to drop from . to . . X . . S F C (0 ,
0) (1 ,
1) (1 , , −
1) (2 , − · , ≥
0) ( · , ≥ BlueprintSearchBounds BlueprintSearchBounds
Figure 2: Failure when searching in multiple subgames.Thus, sticking to the blueprint is preferable to na¨ıve search,which means na¨ıve search is unsafe . Insight:
Na¨ıve search may induce changes in the follower’sstrategy before the subgame, which adjusts the probability ofentering each state within the subgame. If one could enforcethat in the refined subgame, payoffs to the follower in A re-main no less than , then the follower would continue to stay,but possibly with leader payoffs greater than the blueprint.Similarly, we may avoid entering B by enforcing that fol-lower payoff in B not exceed . Example 2.
Consider the game in Figure 2. Here, the fol-lower chooses to exit or stay before the chance node isreached. If the follower chooses stay, then the chance nodedetermines which of two identical subgames is entered. Un-der the blueprint, the follower receives an EV of for choos-ing stay and an EV of for choosing exit.Suppose search is performed only in the left subgame,which decreases the follower’s EV in that subgame from to − . Then, the expected payoff for staying is (1 . , . Thefollower continues to favor staying (breaking ties in favor ofthe leader) and the leader’s EV increases from . to . .Now suppose search is performed on whichever subgameis encountered during play. Then the follower knows that hisEV for staying will be − regardless of which subgame isreached, and thus will exit. Exiting decreases the leader’spayoff to compared to the blueprint value of , and thusthe search is unsafe . Insight:
Performing search using Algorithm 1 is equivalentto performing search for all subgames. Even if conductingsearch only in a single subgame does not cause a shift in thefollower’s strategy, the combined effect of applying searchto multiple subgames may. Again, one could remedy thisby carefully selecting constraints. If we bound the followerpost-search EVs for each of the 2 subgames to be ≥ , thenwe can guarantee that X would never be chosen. Note thatthis is not the only scheme which ensures safety, e.g., a lowerbound of and − for the left and right subgame is safe too. The crux of our method is to modify na¨ıve search such thatthe follower’s best response remains the same even when This counterexample is because of the general-sum nature ofthis game, and does not occur in zero-sum games. This issue occurs in zero-sum games aswell (Brown and Sandholm 2017a). earch is applied. This, in turn, can be achieved by enforc-ing bounds on the follower’s EV in any subgame strategiescomputed via search. Concretely, our search method com-prises 3 steps, (i) preprocess the follower’s best responseto the blueprint and its values, (ii) identify a set of non-trivial safety bounds on follower payoffs , and (iii) solvingfor the SSE in the subgame reached constrained to respectthe bounds computed in (ii). Preprocessing of Blueprint.
Denote the leader’s se-quence form blueprint strategy as r bp . We will assume thatthe game is small enough such that the follower’s (pure,tiebreaks leader-favored) best response to the blueprintmay be computed— denote it by r bp . We call the setof information sets which, based on r bp have non-zeroprobability of being reached the trunk , T ⊆ I : r bp ( Seq ( T )) = 1 . Next, we traverse the follower’streeplex bottom up and compute the payoffs at each in-formation set and sequence (accounting for chance factors C ( z ) for each leaf). We term these as best-response val-ues (BRVs) under the blueprint. These are recursively com-puted for both σ ∈ Σ and I ∈ I recursively, (i) BRV ( I ) = max σ ∈ A ( I ) BRV ( σ ) , and (ii) BRV ( σ ) = P I ′ ∈I : Seq ( I ′ )= σ BRV ( I ′ ) + P σ ∈ Σ r ( σ ) g ( σ , σ ) ,where g i ( σ i , σ − i ) is the expected utility of player i over allnodes reached when executing the sequence pair ( σ i , σ − i ) , g i ( σ i , σ − i ) = P h ∈ Z : σ k = Seq k ( h ) u i ( h ) · C ( h ) . This process-ing step involves just a single traversal of the game tree. Generating Safety Bounds.
Loosely speaking, we tra-verse the follower’s treeplex top down while propagatingfollower payoffs bounds which guarantee that the follower’sbest response remains r bp . This is recursively done until wereach an information set I belonging to some subgame j .The EV of I is then required to satisfy its associated boundfor future steps of the algorithm. We illustrate the boundsgeneration process using the worked example in Figure 3.Values of information sets and sequences are in blue and an-notated in order of traversal alongside their bounds, whosecomputation is as follows.• The empty sequence ∅ requires a value greater than −∞ .• For each information set (in this case, B) which follows ∅ ,we require (vacuously) for their values to be ≥ −∞ .• We want the sequence C to be chosen. Hence, the valueof C has to be ≥ , which, with the lower bound of −∞ gives a final bound of ≥ .• Sum of values for parallel information sets D and H mustbe greater than C. Under the blueprint, their sum is . Thisgives a ‘slack’ of , split evenly between D and H, yieldingbounds of − and − respectively.• Sequence E requires a value no smaller than F, G, and thebound for by the D, which contains it. Other actions havefollower payoffs smaller than . We set a lower bound of for E and an upper bound of for F and G.• Sequence I should be chosen over J. Furthermore, thevalue of sequence I should be ≥ —this was the bound One could trivially achieve safety by sticking to the blueprint. ∅ ≥ −∞ )(B, ≥ −∞ )(C, ≥ )(D, ≥ )(E, ≥ ) (F, ≤ ) (G, ≤ ) (H, ≥ )(I, ≥ . ) (J, ≤ . ) ≥ . ≥ . ≤ ≤ ≥ . (K, ≤ . ) (L, ≤ . )... ... Figure 3: Example of bounds computation. Filled boxes rep-resent information sets, circled nodes are terminal payoff en-tries, hollow boxes are sequences which may be followedby parallel information sets, which are in turn preceded bydashed lines. The dashed rectangle indicates subgames, ofwhich we only show the head information sets of. BRVs ofsequences and information sets are within the boxes and the(labels, computed bounds) are placed next to them.propagated into H. We choose the tighter of the J’sblueprint value and the propagated bound of , yieldinga bound of ≥ . for I and a bound of ≤ . for J.• Sequences K and L should not be reached if the follower’sbest response to the blueprint is followed—we cannotmake this portion too appealing. Hence, we apply upperbounds of . for sequences K and L.A formal description for bounds generation is deferred tothe Appendix. The procedure is recursive and identical tothe worked example. It takes as input the game, blueprint,best response r bp i , and follower BRVs and returns upper andlower bounds B ( I ) for all head information sets of subgame j , I j , head . Since the blueprint strategy and its best responsesatisfies these bounds, feasibility is guaranteed. By construc-tion, lower and upper bounds are obtained for informationsets within and outside the trunk respectively. Note also thatbounds computation requires only a single traversal of thefollower’s treeplex, which is smaller than the game tree.The bounds generated are not unique. (i) Suppose weare splitting lower bounds at an information set I betweenchild sequences (e.g., the way bounds for sequences E, F,G under information set D were computed). Let I have alower bound of B ( I ) and the best and second best actions σ ∗ and σ ′ under the blueprint is v ∗ and v ′ respectively. Ourimplementation sets lower and upper bounds for σ ∗ , σ ′ tobe max { ( v ∗ + v ′ ) / , B ( I ) } . However, any bound of theform max { α · v ∗ + (1 − α ) · v ′ , B ( I ) } , α ∈ [0 , achievessafety. (ii) Splitting lower bounds at sequences σ betweenarallel information sets under σ (e.g., when splitting theslack at C between D and H, or in Example 2.). Our imple-mentation splits slack evenly though any non-negative splitsuffices. We explore these issues in our experiments. MILP formulation for constrained SSE.
Once safetybounds are generated, we can include them in a MILP sim-ilar to that of Bosansky and Cermak (2015). The solutionof this MILP is the strategy of the leader, normalized suchthat r Seq ( I ) for all I ∈ I j , head is equal to . Let Z j be the set of terminal states which lie within subgame j , Z j = Z ∩ H j sub . Let C j ( z ) be the new chance probabilitywhen all actions taken prior to the subgame are converted tobe by chance, according to the blueprint. That is, C j ( z ) = C ( z ) · r bp Seq ( I , head ( z )) · r bp Seq ( I , head ( z )) . Similarly,we set g j ( σ , σ ) = P h ∈ Z j : σ k = Seq k ( h ) u ( h ) · C j ( h ) . Let M ( j ) be the total probability mass entering subgame j inthe original game when the blueprint strategy and best re-sponse, M ( j ) = P z ∈ Z j C ( z ) · r bp ( Seq ( z )) · r bp ( Seq ( z )) . max p,r,v,s X z ∈ Z j p ( z ) u ( z ) C j ( z ) (1) v Inf ( σ ) = s σ + X I ′ ∈I : Seq ( I ′ )= σ v I ′ + X σ ∈ Σ r ( σ ) g j ( σ , σ ) ∀ σ ∈ Σ j (2) r i ( σ i ) = 1 ∀ i ∈ { , } , I i ∈ I ji, head : Seq i ( I i ) = σ i (3) r i ( σ i ) = X a ∈ A i ( I i ) r i ( σ i a ) ∀ i ∈ { , } ∀ I i ∈ I ji , σ i = Seq i ( I i ) (4) ≤ s σ ≤ (1 − r ( σ )) · M ∀ σ ∈ Σ j (5) ≤ p ( z ) ≤ r ( Seq ( z )) ∀ z ∈ Z j (6) ≤ p ( z ) ≤ r ( Seq ( z )) ∀ z ∈ Z j (7) X z ∈ Z j p ( z ) C j ( z ) = M ( j ) (8) v I ≥ B ( I ) ∀ I ∈ I j , head ∩ T (9) v I ≤ B ( I ) ∀ I ∈ I j , head ∩ T (10) r ( σ ) ∈ { , } ∀ σ ∈ Σ j (11) ≤ r ( σ ) ≤ ∀ σ ∈ Σ j (12)Conceptually, p ( z ) is such that the probability of reaching z is p ( z ) C ( z ) . r and r are the leader and follower sequenceform strategies, v is the value of information set when r isadopted and s is the slack for each sequence.Objective (1) is the expected payoff in the full game thatthe leader gets from subgame j , (3) and (4) are sequenceform constraints, (5), (6), (7), (11) and (12) ensure the fol-lower is best responding, and (8) ensures that the probabilitymass entering j is identical to the blueprint. Constraints (9)and (10) are bounds previously generated and ensure the fol-lower does not deviate from r bp after refinement. We discussmore details of the MILP in the appendix. terminate continue . terminate continue . CA’ B’ ( −∞ ,
0) (0 , A B −∞ Figure 4: The transformed tree for solving the constrainedSSE with the safety bounds of Figure 1. A’ and B’ are aux-iliary states introduced for the follower. B −∞ is identical toB, except that leader payoffs are −∞ . Safe Search as SSE solutions.
One is not restricted to us-ing a MILP to enforce these safety bounds. Here we showthat the constrained SSE to be solved may be interpretedas the solution to another SSE problem. This implies thatwe can employ other SSE solvers, such as those involv-ing strategy generation ( ˇCern`y, Boˇsansk`y, and Kiekintveld2018). We briefly describe how transformation is performedon the j -th subgame, under the mild assumption that fol-lower head information sets I j , head are the initial states in H j sub . More detail is provided in the Appendix. Figure 4shows an example construction based on the game in Fig-ure 1.For every state h ∈ I j , head , we compute the probability ω h of reaching h under the blueprint, assuming the followerplays to reach it. The transformed game begins with chanceleading to a normalized distribution of ω over these states.Now, recall that we need to enforce bounds on follower pay-offs for head information sets I ∈ I j , head . To enforce alower bound BRV ( I ) ≥ B ( I ) , we use a technique sim-ilar to subgame resolving (Burch, Johanson, and Bowling2014). Before each state h ∈ I , insert an auxiliarystate h ′ belonging to a new information set I ′ , where thefollower may opt to terminate the game with a payoffof ( −∞ , B ( I ) / ( ω h | I | )) or continue to h , whose subse-quent states are unchanged. If the leader’s strategy has
BRV ( I ) < B ( I ) , the follower would do better by ter-minating the game, leaving the leader with −∞ payoff.Enforcing upper bounds BRV ( I ) ≤ B ( I ) may be doneanalogously. First, we reduce the payoffs to the leader forall leaves underneath I to −∞ . Second, the follower has anadditional action at I to terminate the game with a payoffof (0 , B ( I ) / ( ω h | I | )) . If the the follower’s response to theleader’s strategy gives BRV ( I ) > B ( I ) , then the followerwould choose some action other than to terminate the game,which nets the leader −∞ . If the bounds are satisfied, thenthe leader gets a payoff of , which is expected given that anupper bound implies that I is not part of the trunk. In this section we show experimental results for our searchalgorithm (based on the MILP in Section 4) in synthetic 2-stage games, Goofspiel and Leduc hold’em poker (modified The factor ω h | I | arises since B was computed in treeplexes,which already takes into account chance. o be general-sum). Experiments were conducted on a In-tel i7-7700K @ 4.20GHz with 4 cores and 64GB of RAM.We use the commercial solver Gurobi (Gurobi Optimization2019) to solve all instances of MILPs.We show that even if Stackelberg equilibrium com-putation for the entire game (using the MILP ofBosansky and Cermak (2015)) is warm started using theblueprint strategy r bp and follower’s best response r bp , thenin large games it is still intractable to compute a strategy.In fact, in some cases it is intractable to even generate themodel, let alone solve it. In contrast, our safe search algo-rithm can be done at a far lower computational cost andwith far less memory. Since our games are larger than whatGurobi is able to solve to completion in reasonable time, weinstead constrain the time allowed to solve each (sub)gameand report the incumbent solution. We consider only the timetaken by Gurobi in solving the MILP, which dominates pre-processing and bounds generation, both of which only re-quire a constant number of passes over the game tree. In allcases, we warm-start Gurobi with the blueprint strategy.To properly evaluate the benefits of search, we performsearch on every subgame and combine the resulting subgamestrategies to obtain the implicit full-game strategy prescribedby Algorithm 1. The follower’s best response to this strategyis computed and used to evaluate the leader’s payoff. Notethat this is only done to measure how closely the algorithmapproximates a SSE —in practice, search is applied only tothe subgame reached in actual play and is performed justonce. Hence, the worst-case time for a single playthroughis no worse than the longest time required for search over a single subgame (and not the sum over all subgames).We compare our method against the MILP proposedby Bosansky and Cermak (2015) rather the more re-cent incremental strategy generation method proposed byˇCern`y, Boˇsansk`y, and Kiekintveld (2018). The former isflexible and applies to all EFGs with perfect recall, whilethe latter involves the Stackelberg Extensive Form Corre-lated Equilibrium (SEFCE) as a subroutine for strategy gen-eration. Computing an SEFCE is itself computationally dif-ficult except in games with no chance, in which case findingan SEFCE can be written as a linear program. Two-Stage Games.
The two-stage game closely resem-bles a 2-step Markov game. In the first stage, both play-ers play a general-sum matrix game G main of size n × n ,after which, actions are made public. In the second stage,one out of M secondary games { G j sec } , each general-sumand of size m × m is chosen and played. Each player ob-tains payoffs equal to the sum of their payoffs for eachstage. Given that the leader played action a , the proba-bility of transitioning to game j is given by the mixture, P ( G ( j ) sec | a ) = κ · X j,a + (1 − κ ) · q j , where X j,a is a M × n transition matrix non-negative entries and columnssumming to and q j lies on the M dimensional probabilitysimplex. Here, κ governs the level of influence the leader’sstrategy has on the next stage. One may be tempted to first solve the M Stackelberg gamesindependently, and then apply backward induction, solving the n M m κ
Blueprint Ours Full-game2 2 2 . . . . . . . . X are chosen by independently drawingweights uniformly from [0 , and re-normalizing, while q is uniform. We generate games each for different set-tings of M , m , n and κ . A subgame was defined for eachaction pair played in the first stage, together with the sec-ondary game transitioned into. The blueprint was chosen tobe the SSE of the first stage alone , with actions chosen uni-formly at random for the second stage. The SSE for the firststage was solved using the multiple LP method and runs innegligible time ( < seconds). For full-game solving, we al-lowed Gurobi to run for a maximum of s. For search, weallowed seconds—in practice, this never exceeds morethan seconds for any subgame.We report the average quality of solutions in Table 1. Thefull-game solver reports the optimal solution if converges.This occurs in the smaller game settings where ( M = m ≤ ). In these cases search performs near-optimally. In largergames ( M = m ≥ ), full-game search fails to convergeand barely outperforms the blueprint strategy. In fact, in thelargest setting only 2 out of 10 cases resulted in any im-provement from the blueprint, and even so, still performedworse than our method. Our method yields substantial im-provements from the blueprint regardless of κ . Goofspiel.
Goofspiel (Ross 1971) is a game where play-ers simultaneously bid over a sequence of n prizes, valued at , · · · , n − . Each player owns cards worth , · · · , n , whichare used in closed bids for prizes auctioned over a span of n rounds. Bids are public after each round. Cards bid are dis-carded regardless of the auction outcome. The player withthe higher bid wins the prize. In a tie, neither player winsand the prize is discarded. Hence, Goofspiel is not zero-sum,players can benefit by coordinating to avoid ties.In our setting, the n prizes are ordered uniformly in anorder unknown to players. Subgames are selected to be allstates which have the same bids and prizes after first m rounds are resolved. As m grows, there are fewer but largersubgames. When m = n , the only subgame is the entiregame. The blueprint was chosen to be the NE under a zero first stage with payoffs adjusted for the second. This intuition isincorrect—the leader can issue non-credible threats in the secondstage, inducing the follower to behave favorably in the first. ( | Σ | , |I| ) m Num. Max. timeof sub- per sub- Leadergames game (s) utility (2 . , . · ‡ . · . · † (2 . , . · . · . · . · ‡ . · . · † † This is the earliest time thatthe incumbent solution achieves the given utility. ‡ This isequivalent to full-game search.(constant)-sum version of Goofspiel, where players split theprize evenly in ties. The NE of a zero-sum game may becomputed efficiently using the sequence form representation(Von Stengel 1996). Under the blueprint, the leader obtainsa utility of . and . for n = 4 and n = 5 respectively.Table 2 summarizes the solution quality and running timeas we vary n, m . When n = m , Gurobi struggles to solve theprogram to optimality and we report the best incumbent so-lution found within a shorter time frame. As a sanity check,observe the leader’s utility is never worse than the blueprint.When n = 4 , the incumbent solution for solving the fullgame has improved significantly from the blueprint in fewerthan seconds. This indicates that the game is sufficientlysmall that performing search is not a good idea. However,when n = 5 , solving the full game ( m = 5 ) required times longer compared to search ( m ∈ { , } ) in order to ob-tain any improvement over the blueprint, while search onlyneeded seconds in order to improve upon the blueprintin a subgame. Furthermore, full-game solving required morethan GB of memory while search required less than GB.
Leduc Hold’em.
Leduc hold’em (Southey et al. 2012) is asimplified form of Texas hold’em. Players are dealt a singlecard in the beginning. In our variant there are n cards with2 suits, 2 betting rounds, an initial bet of per player, anda maximum of 5 bets per round. The bet sizes for the firstand second round are and . In the second round, a publiccard is revealed. If a player’s card matches the number of thepublic card, then he/she wins in a showdown, else the highercard wins (a tie is also possible).Our variant of Leduc includes rake , which is a commis-sion fee to the house. We assume for simplicity a fixed rake ρ = 0 . . This means that the winner receives a payoff of (1 − ρ ) x instead of x . The loser still receives a payoff of − x .When ρ > , the game is not zero-sum. Player 1 assumes therole of leader. Subgames are defined to be all states with thesame public information from the second round onward. Theblueprint strategy was obtained using the unraked ( ρ = 0 ,zero-sum) variant and is solved efficiently using a linear pro- n | Σ | |I| Blueprint Ours Full-game3 5377 2016 -0.1738 -0.1686 -0.1335 -0.1862 -0.18825 16001 6000 -0.2028 -0.2003 -0.20286 23425 8784 -0.1832 -0.1780 -0.18328 42497 15936 -0.1670 -0.1609
N/ATable 3: Leader payoffs for Leduc hold’em with n cards. α . .
25 0 . †† .
75 0 . Goofspiel 5.32 5.34 5.29 5.31 5.30Leduc -0.184 -0.185 -0.186 -0.188 -0.189 β †† Goofspiel 5.29 5.35 5.55 5.56 5.50Leduc -0.186 -0.182 -0.178 -0.212 -0.212Table 4: Leader payoffs for varying α and β . We considerGoofspiel with n = 5 , m = 4 and Leduc Hold’em with n =4 . Time constraints are the same as previous experiments. †† These are the default values for α and β .gram. We limited the full-game method to a maximum of seconds and seconds per subgame for our method.We reiterate that since we perform search only on subgamesencountered in actual play, seconds is an upper bound on the time taken for a single playthrough when employingsearch (some SSE are easier than others to solve).The results are summarized in Table 3. For large games,the full-game method struggles with improving on theblueprint. In fact, when n = 8 the number of terminal statesis so large that the Gurobi model could not be created evenafter hours. Even when n = 6 , model construction took anhour—it had near · constraints and · variables, ofwhich . · are binary. Even when the model was suc-cessfully built, no progress beyond the blueprint was made. Varying Bound Generation Parameters.
We now ex-plore how varying α affects solution quality. Furthermore,we experiment with multiplying the slack (see informationsets D and H in Section 4) by a constant β ≥ . This resultsin weaker but potentially unsafe bounds. Results on Goof-spiel and Leduc are summarized in Figure 4. We observethat lower values of α yield slightly better performance inLeduc, but did not see any clear trend for Goofspiel. As β increases, we observe significant improvements initially.However, when β is too large, performance suffers and evenbecomes unsafe in the case of Leduc. These results suggestthat search may be more effective with principled selectionsof α and β , which we leave for future work. In this paper, we have extended safe search to the realmof SSE in EFGs. We show that safety may be achieved byadding a few straightforward bounds on the value of fol-lower information sets. We showed it is possible to cast thebounded search problem as another SSE, which makes ourapproach complementary to other offline methods. Our ex-erimental results on Leduc hold’em demonstrate the abilityof our method to scale to large games beyond those whichMILPs may solve. Future work includes relaxing constraintson subgames and extension to other equilibrium concepts.
References
Bosansky, B.; and Cermak, J. 2015. Sequence-form algo-rithm for computing stackelberg equilibria in extensive-formgames. In
Twenty-Ninth AAAI Conference on Artificial In-telligence .Brown, N.; and Sandholm, T. 2017a. Safe and nested sub-game solving for imperfect-information games. In
Advancesin neural information processing systems , 689–699.Brown, N.; and Sandholm, T. 2017b. Superhuman AI forheads-up no-limit poker: Libratus beats top professionals.
Science eaao1733.Brown, N.; and Sandholm, T. 2019. Superhuman AI for mul-tiplayer poker.
Science
Twenty-Eighth AAAI Conference on Artificial Intelligence .Campbell, M.; Hoane Jr, A. J.; and Hsu, F.-h. 2002. Deepblue.
Artificial intelligence
Proceedings of the 2018ACM Conference on Economics and Computation , 151–168.ACM.Conitzer, V.; and Sandholm, T. 2006. Computing the opti-mal strategy to commit to. In
Proceedings of the 7th ACMconference on Electronic commerce , 82–90. ACM.Fang, F.; Nguyen, T. H.; Pickles, R.; Lam, W. Y.; Clements,G. R.; An, B.; Singh, A.; Schwedock, B. C.; Tambe, M.;and Lemieux, A. 2017. PAWS-A Deployed Game-TheoreticApplication to Combat Poaching.
AI Magazine
IEEE transactions on Systems Science and Cybernetics
Mathematics of Operations Research
The 10th International Conference on Au-tonomous Agents and Multiagent Systems-Volume 3 , 997–1004. International Foundation for Autonomous Agents andMultiagent Systems.Kroer, C.; Farina, G.; and Sandholm, T. 2018. Solv-ing Large Sequential Games with the Excessive GapTechnique. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett,R., eds.,
Advances in Neural Information ProcessingSystems 31 , 864–874. Curran Associates, Inc. URLhttp://papers.nips.cc/paper/7366-solving-large-sequential-games-with-the-excessive-gap-technique.pdf.Kuhn, H. W. 1950. A simplified two-person poker.
Contri-butions to the Theory of Games
1: 97–103.Letchford, J.; and Conitzer, V. 2010. Computing optimalstrategies to commit to in extensive-form games. In
Pro-ceedings of the 11th ACM conference on Electronic com-merce , 83–92. ACM.Lin, S. 1965. Computer solutions of the traveling salesmanproblem.
Bell System Technical Journal
Science
Thirtieth AAAI Conference onArtificial Intelligence .Pita, J.; Jain, M.; Marecki, J.; Ord´o˜nez, F.; Portway, C.;Tambe, M.; Western, C.; Paruchuri, P.; and Kraus, S. 2008.Deployed ARMOR protection: the application of a gametheoretic model for security at the Los Angeles InternationalAirport. In
Proceedings of the 7th international joint con-ference on Autonomous agents and multiagent systems: in-dustrial track , 125–132. International Foundation for Au-tonomous Agents and Multiagent Systems.Ross, S. M. 1971. Goofspiel — the game of pure strategy.
Journal of Applied Probability na-ture
Science arXiv preprint arXiv:1207.1411 .Tesauro, G. 1995. Temporal difference learning and TD-Gammon.
Communications of the ACM
Games and Economic Behavior
Treeplexes
Algorithms utilizing the sequence form may often be betterunderstood when visualized as treeplexes (Hoda et al. 2010;Kroer, Farina, and Sandholm 2018), with one treeplex de-fined for each player. Informally, a treeplex may be visu-alized as a tree with adjacent nodes alternating between in-formation sets and sequences (actions), with the empty se-quence forming the root of the treeplex. An example of atreeplex for Kuhn poker (Kuhn 1950) is given in Figure 5. Inthis example, a valid (pure) realization plan would be to raisewhen one obtains a King, or Jack, and when dealt a Queen,call, and follow by folding if the opponent raises thereafter.Mixed strategies in sequence form may be represented bysequence form constraints for each player i , r i ( ∅ ) = 1 and r i ( σ i ) = P a ∈ A i ( I i ) r i ( σ i a ) for each I i ∈ I i and σ i = Seq i ( I i ) . Graphically, this constraint may be visual-ized as ‘flow’ conservation constraints at infosets, with thisflow being duplicated for parallel information sets.Operations such as best responses have easy interpreta-tions when visualized as treeplexes. When one player’s strat-egy is fixed, the expected values of all leaves may be deter-mined (multiplied by C ( h ) and the probability that the otherplayer selects his required sequence). From there, the valueof each information set and sequence may be computed viaa bottom-up traversal of the treeplex; when parallel infor-mation sets are encountered, their values are summed, andwhen an information set is reached, we select the action withthe highest value. After the treeplex is traversed, actions cho-sen in each information set describe the behavioral strategyof the best response. ∅ J [Call] [Fold]Raise [Call] Call Fold[Raise]Call Q K
Figure 5: Treeplex of player 1 in Kuhn Poker. Filled squaresrepresent information sets, circled nodes are terminal pay-offs, hollow squares are points which lead to parallel in-formation sets, which are preceded by dashed lines. Ac-tions/sequences are given by full lines, and information fromthe second player is in square brackets. The treeplex is‘rooted’ at the empty sequence. Subtreeplexes for the J and K outcomes are identical and thus omitted. B Algorithm for Computing Bounds
In Section 4, we provided a worked example of how onecould compute a set of non-trivial follower bounds which guarantee safety. Algorithm 2 provides an algorithmic de-scription of how this could be done.
Function C OMPUTE B OUNDS
Input :
EFG specification, Blueprint and its BRVs
Output:
Bounds B ( I ) for all I ∈ I j , head E XP S EQ T RUNK ( ∅ , −∞ ) endFunction E XP S EQ T RUNK ( σ , lb) for I ∈ {I | Seq ( I ) = σ } do slack ← ( BRV ( σ ) − lb ) / |{I | Seq ( I ) = σ }| E XP I NF T RUNK ( I , BRV ( I ) -slack) endendFunction E XP I NF T RUNK ( I , lb) if I ∈ I j , head then B ( I ) ← lb returnend σ ∗ , σ ′ ← best, second best actions in I under blueprint v ∗ , v ′ ← BRV ( σ ∗ ) , BRV ( σ ′ ) bound ← max (cid:16) v ∗ + v ′ , lb (cid:17) for σ ∈ { Σ | Inf ( σ ) = I } doif σ = σ ∗ is in best response then E XP S EQ T RUNK ( σ , bound) else E XP S EQ N ON T RUNK ( σ , bound) endendendFunction E XP S EQ N ON T RUNK ( σ , ub) for I ∈ {I | Seq ( I ) = σ } do slack ← ( ub − BRV ( σ )) / |{I | Seq ( I ) = σ }| E XP I NF N ON T RUNK ( I , BRV ( I )+slack) endendFunction E XP I NF N ON T RUNK ( I , ub) if I ∈ I j , head then B ( I ) ← ub ; returnendfor σ ∈ { Σ | Inf ( σ ) = I } do E XP S EQ N ON T RUNK ( σ, ub ) endend Algorithm 2:
Bounds generation procedure.The C
OMPUTE B OUNDS function is the starting point ofthe bounds generation algorithm. It takes in an EFG speci-fication, a blueprint given, and the BRVs (of sequences σ ∈ Σ and information sets I ∈ I computed while preprocess-ing the blueprint. Our goal is to populate the function B ( I ) which maps information sets I ∈ I j , head to upper/lowerbounds on their values. We begin the recursive procedure bycalling E XP S EQ T RUNK on the empty sequence ∅ and vacu-ous lower bound −∞ . Note that ∅ is always in the trunk (bydefinition).Specifically, E XP S EQ T RUNK takes in some sequence σ ∈ Σ and a lower bound lb . Note that we are guaranteed that lb ≤ BRV ( σ ) . The function compute a set of lower boundson payoffs of information sets I following σ such that (a)the best response to the blueprint satisfies these suggestedounds and (b) under the given bounds on values of I , thefollower can at least expect a payoff of lb . This is achievedby computing the slack , the excess of the blueprint with re-spect to lb split equally between all I following σ . For eachof these I , we require their value be no smaller than thelower bound given by their BRVs minus the slack. Naturally,this bound is weaker than the BRV itself.Now, the function E XP I NF T RUNK does the same boundgeneration process for a given information set I inside thetrunk, given a lower bound lb . First, if I is part of the headof a game in subgame j , then we simply store lb into B ( I ) .If not, then we look at all sequences immediately follow-ing I —specifically, we compare the best and second bestsequences, given by σ ∗ and σ ′ . To ensure that the best se-quence still remains the best response, we need to decide ona threshold bound such that (i) all sequences other than σ ∗ does not exceed bound , and the value of σ ∗ is no less than bound and (ii) the blueprint itself must obey bound . One wayto specify bound is to take the average of the BRVs of σ ∗ and σ ′ . For σ ∗ we recursively compute bounds by callingE XP S EQ T RUNK . For all other sequences, we enter a newrecursive procedure which generates upper bounds .The function E XP S EQ N ON T RUNK is similar in imple-mentation to its counterpart E XP S EQ T RUNK , except thatwe compute upper instead of lower bounds. Likewise, E X - P I NFO N ON T RUNK stores an upper bound if I is in thehead of subgame j , otherwise, it uses recursive calls to E X - P I NF N ON T RUNK to make sure that all immediate sequencesfollowing I does not have value greater than ub .In Section 4, we remarked how bounds could be gener-ated in alternative ways, for example, by varying α . Thiswould alter the computation of bound in E XP I NF T RUNK . InSection 5, we experiment with increasing the slack by somefactor β ≥ . That is, we alter the computation of slack inE XP S EQ T RUNK by multiplying it by β . Note that this canpotentially lead to unsafe behavior, since the follower’s pay-off under this sequence may possibly be strictly less than lb . C Details of MILP formulation for SSE
First, we review the MILP of Bosansky and Cermak (2015). max p,r,v,s X z ∈ Z p ( z ) u ( z ) C ( z ) (13) v inf ( σ ) = s σ + X I ′ ∈I : Seq ( I ′ )= σ v I ′ + X σ ∈ Σ r ( σ ) g ( σ , σ ) ∀ σ ∈ Σ (14) r i ( ∅ ) = 1 ∀ i ∈ { , } (15) r i ( σ i ) = X a ∈ A i ( I i ) r i ( σ i a ) ∀ i ∈ { , } ∀ I i ∈ I i , σ i = Seq i ( I i ) (16) ≤ s σ ≤ (1 − r ( σ )) · M ∀ σ ∈ Σ (17) ≤ p ( z ) ≤ r ( Seq ( z )) ∀ z ∈ Z (18) ≤ p ( z ) ≤ r ( Seq ( z )) ∀ z ∈ Z (19) X z ∈ Z p ( z ) C ( z ) = 1 (20) r ( σ ) ∈ { , } ∀ σ ∈ Σ (21) ≤ r ( σ ) ≤ ∀ σ ∈ Σ (22)Conceptually, p ( z ) is the product of player probabilitiesto reach leaf z , such that the probability of reaching z is p ( z ) C ( z ) . The variables r and r are the leader and followerstrategies in sequence form respectively, while v is the EV ofeach follower information set when r and r are adopted. s is the (non-negative) slack for each sequence/action in eachinformation set, i.e., the difference of the value of an in-formation set and the value of a particular sequence/actionwithin that information set. The term g i ( σ i , σ − i ) is the EVof player i over all nodes reached when executing a pair ofsequences ( σ i , σ − i ) , g i ( σ i , σ − i ) = P h ∈ Z : σ k = Seq k ( h ) u i ( h ) ·C ( h ) .Constraint (14) ties in the values of the information set v to the slack variables s and payoffs. That is, for everysequence σ of the follower, the value of its preceding in-formation set is equal to the EV of all information sets I ′ immediately following σ (second term) added with the pay-offs from all leaf sequences terminating with σ (third term),compensated by the slack of σ . Constraints (15) and (16)are the sequence form constraints (Von Stengel 1996). Con-straint (17) ensures that, for large enough values of M , if thefollower’s sequence form strategy is for some sequence,then the slack for that sequence cannot be positive, i.e., thefollower must be choosing the best action for himself. Con-straints (18), (19), and (20) ensure that p ( z ) C ( z ) is indeedthe probability of reaching each leaf. Constraints (21) and(22) enforce that the follower’s best response is pure, andthat sequence form strategies must lie in [0 , for all se-quences. The objective (13) is the expected utility of theleader, which is linear in p ( z ) .The MILP we propose for solving the constrained sub-game is similar in spirit. Constraints (2)-(8), (11) and (12)are analogous to constraints (14)-(20), (21), (22) except thatthy apply to the subgame j instead of the full game. Sim-ilarly, the objective (1) is to maximize the payoffs fromwithin subgame j . The key addition is constraint (9) and(10), which are precisely the bounds computed earlier whentraversing the treeplex. D Transformation of Safe Search into SSESolutions
We provide more details on how the constrained SSEcan be cast as another SSE problem. The general ideais loosely related to the subgame resolving method ofBurch, Johanson, and Bowling (2014), although our methodextends to general sum games, and allows for the inclusionof both upper and lower bounds as is needed for our searchoperation.The broad idea behind Burch, Johanson, and Bowling(2014) is to (i) create an initial chance node leading to allleading states in the subgame (i.e., all states h ∈ H j sub suchthat there are no states h ′ ∈ H j sub such that h ′ ⊏ h ) basedon the normalized probability of encountering those statesnder r bp i and (ii) enforce the constraints using a gadget ;specifically, by adding a small number of auxiliary informa-tion sets/actions to help coax the solution to obey the re-quired bounds. Restricted case: initial states h in head informationsets We make the assumption that the I j , head is a subset of theinitial states in subgame j . Preliminaries
For some sequence form strategy pair r , r for leader and follower respectively, the expected pay-off to player i is given by P z ∈ Z : σ i = Seq i ( z ) r ( σ ) · r ( σ ) · u i ( z ) · C ( z ) , i.e., the summation of the utilities u i ( z ) ofeach leaf of the game, multiplied by the probability that bothplayers play the required sequences r i and the chance factor C ( z ) . That is, the utility from each leaf z is weighed by theprobability of reaching it r ( σ ) · r ( σ ) · C ( z ) . The value ofan information set I i ∈ I i is the contribution from all leavesunder I i , i.e., V i ( I i ) = P h ∈ Z,h ′ ∈ I i ,h ′ ⊏ h : σ k = Seq k ( h ) r ( σ ) · r ( σ ) · u i ( h ) · C ( h ) , taking into account the effect of chancefor each leaf.Now let b i ( σ i ) be the behavioral strategy associated with σ i , i.e., b i ( σ i ) = r i ( σ i ) /r i ( Inf ( Seq ( σ i ))) if r i > , b i =0 otherwise . . The sequence form r i ( σ i ) is the product of be-havioral strategies in previous information sets. Hence, eachof these terms in V i (be it from leader, follower, or chance)can be separated into products involving those before or af-ter subgame j . That is, for a leaf z ∈ Z j , the probability ofreaching it can be written as p ( z ) = ˆ r j ( z ) · ˆ r j ( z ) · ˆ C j ( z ) · ˇ r j ( z ) · ˇ r j ( z ) · ˇ C j ( z ) , where ˆ( · ) j and ˇ( · ) j represent probabil-ities accrued before and after subgame j respectively. The original game.
Figure 6 illustrates our setting. Forhead information set h ∈ I j , head , define ω bp h to be the prob-ability of reaching h following the leader’s blueprint assum-ing the follower plays to reach h . Now denote by h jz the firststate in subgame j leading to leaf z such that ω bp h jz = ˆ r bp ( z ) · ˆ C j ( z ) is the product of the contributions from the leader andchance, but not the follower. Then, the probability of reach-ing leaf z is given by p ( z ) = ω bp h jz · ˆ r j ( z ) · ˇ r j ( z ) · ˇ r j ( z ) · ˇ C j ( z ) .Observe that if z t lies beneath an infoset I t ∈ I j , head ∩ T (i.e., it lies in the trunk and the follower under the blueprintplays to I ), ˆ r j ( z t ) = 1 (since r bp ( Seq ( I )) = 1 ). Con-versely, if z ¯ t lies under I ¯ t ∈ I j , head ∩ ¯ T , i.e., not part ofthe trunk, then ˆ r j ( z ¯ t ) = 0 , and the probability of reachingthe leaf (under ˆ r bp i ) is . From now onward, we will dropthe superscript ( · ) bp from ω when it is clear we are basing iton the blueprint strategy. This is consistent with the notationused in in Section 4.We want to find a strategy ˇ r j such that for every informa-tion state I ∈ I j , head , when ˆ r i = ˆ r bp i , the best response ˇ r j ensures that V ( I ) obeys some upper or lower bounds. That ˇ r j ( z t ) · ˇ r j ( z t ) · ˇ C j ( z t ) ω h jzt · ω h jz ˆ t · away from j start of game h jz t h jz ¯ t z t z ¯ t p ( z t ) = ω h jzt · ˇ r j ( z t ) · ˇ r j ( z t ) · ˇ C j ( z t ) I t I ¯ t Subgame j Figure 6: Decomposition of probabilities for subgame j .Curly lines indicate a series of actions from either playeror chance. The dashed box shows subgame j , while dottingboxes are head information sets in I j , head . Thick lines be-long to states that are below information sets belonging tothe trunk. States that do not lead to subgame j are omitted.Note that the subtrees under the information sets are not disjoint, as information sets of the leader can span over bothsubtrees.is, value of the information set V in this game, given by V ( I ) = X z ∈ Z,h ∈ I ,h ⊏ z ω h jz · ˇ r j ( z ) · ˇ r j ( z ) · ˇ C j ( z ) · u i ( z ) , (23)should be no greater/less than some B ( I ) . The transformed game.
Now consider the transformedsubgame described in Section 4. Figure 7 illustrates how thistransformation may look like and the corresponding proba-bilities. We look at all possible initial states in subgame j ,and start the game with chance leading to head states h witha distribution proportional to ω h . For subgame j , let the nor-malizing constant over initial states be η j > . Note thatsince we are including states outside of the trunk, η j maybe greater or less than . We duplicate every initial state andhead information set, giving the follower an option of ter-minating or continuing on with the game, where terminatingyields an immediate payoff of ( −∞ , B ( I t ) ω h | I t | ) when the infor-mation set containing h belongs to the trunk, and (0 , B ( I t ) ω h | I t | ) otherwise. For leaves which are descendants of non-trunk in-formation sets, i.e., z ∈ Z, h ∈ I ∈ I j , head ∩ ¯ T , h ⊏ z , thepayoffs for the leaders are adjusted to −∞ . There is a one-to-one correspondence between the behavioral strategies inthe modified subgame and the original game simply by using ˇ r ji interchangeably. r j ( z t ) · ˇ r j ( z t ) · ˇ C j ( z t ) η j ω h t η j ω h ¯ t Cz t z ¯ t p ( z t ) = η j ω h jzt · ˇ r j ( z t ) · ˇ r j ( z t ) · ˇ C j ( z t ) I t I ¯ t I t ′ I ¯ t ′ ( −∞ , B ( I t ) ω ht | I t | ) (0 , B ( I ¯ t ) ω h ¯ t | I ¯ t | ) Figure 7: An example of the transformed game of Figure 6.Information sets I t ′ and I ¯ t ′ are newly added informationsets. Auxiliary actions are in blue, all belonging to newlyadded information sets. Leaves which are descendants ofhead information sets not belonging to the trunk are given byred crosses—their payoff to the leader is set to −∞ , whilekeeping the follower payoffs the same.Next, we show that (i) the bounds B are satisfied by thesolution to the transformed game, (ii) for head informationsets in the trunk, any solution satisfying B will never achievea higher payoff by selecting an auxiliary action, and (iii) forhead information sets outside of the trunk, the solutions sat-isfying B will, by selecting the auxiliary action, achieve apayoff greater or equal to continuing with the game. For (i),we first consider information set I t , which is a head infosetalso within the trunk. The terminate action will result in afollower payoff (taking into account the initial chance nodewhich was added) independent of the leader’s subgame strat-egy ˇ r j , X h ∈ I t B ( I t ) ω h | I t | · η j ω h t = η j B ( I t ) . (24)If the follower chooses to continue the game, then his pay-off (now dependent on the leader’s refined strategy ˇ r j is ob-tained by performing weighted sums over leaves η j X z ∈ Z,h ∈ I t h ⊏ z ω h t · ˇ r j ( z ) · ˇ r j ( z ) · ˇ C j ( z ) · u ( z ) . (25)If the leader is to avoid obtaining −∞ , then the followermust choose to remain in the game, which will only happen when (25) ≥ (24), i.e., X z ∈ Z,h ∈ I t h ⊏ z ω h t · ˇ r j ( z ) · ˇ r j ( z ) · ˇ C j ( z ) · u ( z ) ≥ B ( I t ) . (26)The expression on the left hand side of the inequality is pre-cisely the expression in (23). Since the leader can alwaysavoid the −∞ payoff by selecting ˇ r in accordance with theblueprint, the auxiliary action is never chosen and hence, thelower bounds for the value of trunk information sets is al-ways satisfied.Similar expressions can be found for non-trunk head-infosets. (24) holds completely analogously. The solution tothe transformed game needs to make sure the follower al-ways selects the auxiliary action is always chosen for in-formation sets not belonging to the trunk, so as to avoid −∞ payoffs from continuing. Therefore, the solution to thetransformed game guarantees that (26) holds, except that thedirection of the inequality is reversed. Again, the left handside of the expression corresponds to (23). Hence, the SSEfor the transformed game satisfies our reqired bounds. Fur-thermore, by starting from (23) and working backward, wecan also show that any solution ˇ r j satisfying the constrainedSSE does not lead to a best response of −∞ for the leader.Finally, we show that the objective function of the game isidentical up to a positive constant. In the original constrainedSSE problem, we sum over all leaf descendants of the trunkand compute the leader’s utilities weighed by the probabilityof reaching those leaves. X I ∈I j , head ∩ T X h ∈ Z,h ′ ∈ I h ′ ⊏ h ω h jzt · ˇ r j ( z ) · ˇ r j ( z ) · C j ( z ) · u ( z ) the same expression, except for an additional factor of η .Unlike the constrained SSE setting, the initial distributionhas a non-zero probability of starting in a non-trunk state h ∈ I ∈ I , head ∩ ¯ T . However, since the auxiliary action isalways taken under optimality, the leader payoffs from thosebranches will be . The general case
The general case is slightly more complicated. Now, the ini-tial states in subgames may not belong to the follower. Theissue with trying to add auxiliary states the same way asbefore is that there could be leader actions lying betweenthe start of the subgame and the (follower) head informa-tion set. These leader actions have probabilities which arenot yet fixed during the start of the search process. To overcome this, instead of enforcing bounds on information sets,we enforce bounds on parts of their parent sequences (whichwill lie outside the subgame).We first partition the head information sets into groups based on their parent sequence. Groups can contain sin-gletons. Observe that information sets in the same groupare either all in the trunk or all are not. Let the groups be G k = { I k , , I k , , ..., I k ,m k } , I k ,q ∈ I j head , and the group’sheads be the initial states which contain a path to some statein the group, i.e, they are: G k, head = { h | h ⊏ h ′ , h ′ ∈ I k ,q ∈ j , head and h ′′ ∈ H j sub , h ′′ ⊏ h } . Crucially, note that fortwo distinct groups G i , G k , i = k , their heads G i, head and G k, head are disjoint. This because (i) there must be some dif-ference in prior actions that player 2 took (prior to reachingthe head information sets) that caused them to be in differ-ent groups, and (ii) this action must be taken prior to thesubgame by the definition of a head information set.If 2 information sets I , , I , ∈ I j , head have the sameparent sequence σ = Seq ( I , ) = Seq ( I , ) , i.e., theybelong to the same group, G k , it follows that their individ-ual bounds B ( I , ) , B ( I , ) must have come from some spliton some bound (upper or lower) on the value of σ . Insteadof trying to enforce that the bounds for I , and I , are sat-isfied, we try to enforce bounds on the sum of the valuesof I , , I , , since the sum is what is truly important in thebounds for σ when we perform the bounds generation pro-cedure.The transformation then proceeds in the same way as therestricted case, except that we operate on the heads of each group , rather than on the head information sets. The boundsfor heads of each group is the sum of the bounds of headinformation sets in that group, and that the factor contain-ing the size of the head information set is replaced by thenumber of heads for that group.Upper and lower bounds are enforced using the same gad-get as the restricted case, depending on whether the boundis an upper or lower bound. Figure 8 shows an example ofa lower bound in group G k . Note that the follower payoffin the auxiliary actions contains a sum over bounds overall information sets belonging to the group. Technically, weare performing safe search while respecting weaker (but stillsafe) bounds. Upper bounds are done the same way analo-gous to the restricted case. η j ω h CG k, head I ′ ,k I k , I k , h ( −∞ , P q B ( I k ,q ) ω h | G k, head | ) Figure 8: An example of a general transformation. Browndashed lines are the heads of individual groups. I k , and I k , belong to the same group G k with the heads G k, head . Thenewly created auxiliary states are in the new information set I ′ ,k . In this case, G k is in the trunk, hence we enforce alower bound being enforced for G kk