Efficient Deviation Types and Learning for Hindsight Rationality in Extensive-Form Games
Dustin Morrill, Ryan D'Orazio, Marc Lanctot, James R. Wright, Michael Bowling, Amy Greenwald
EEfficient Deviation Types and Learning for HindsightRationality in Extensive-Form Games
Dustin Morrill [email protected] Ryan D’Orazio Marc Lanctot James R. Wright Michael Bowling
1, 3
Amy Greenwald University of Alberta; Alberta Machine Intelligence Institute, Canada Universit´e de Montr´eal; Mila, Canada DeepMind Brown University, United States
Abstract
Hindsight rationality is an approach to playing multi-agent, general-sum gamesthat prescribes no-regret learning dynamics and describes jointly rational behav-ior with mediated equilibria. We explore the space of deviation types in extensive-form games (EFGs) and discover powerful types that are efficient to compute ingames with moderate lengths. Specifically, we identify four new types of devia-tions that subsume previously studied types within a broader class we call partialsequence deviations. Integrating the idea of time selection regret minimizationinto counterfactual regret minimization (CFR), we introduce the extensive-formregret minimization (EFR) algorithm that is hindsight rational for a general andnatural class of deviations in EFGs. We provide instantiations and regret boundsfor EFR that correspond to each partial sequence deviation type. In addition, wepresent a thorough empirical analysis of EFR’s performance with different devi-ation types in common benchmark games. As theory suggests, instantiating EFRwith stronger deviations leads to behavior that tends to outperform that of weakerdeviations.
We seek more effective algorithms for playing multi-player, general-sum extensive-form games(EFGs). The hindsight rationality framework (Morrill et al. 2020) provides the theoretical foun-dation for a game playing approach that prescribes no-regret dynamics and describes jointly rationalbehavior with mediated equilibria (Aumann 1974). Rationality within this framework is measured byregret in hindsight relative to strategy transformations, also called deviations, rather than as prospec-tive optimality with respect to beliefs. Each deviation provides a baseline that a learner must surpass,so a stronger set of baselines pushes a hindsight rational learner to perform better.While larger deviation sets containing more sophisticated deviations produce stronger baselines,they also raise computational and storage requirements.
E.g ., the internal deviations (Foster andVohra 1999) conditionally transform one particular strategy to another and are fundamentallystronger than the external deviations that change every strategy into another unconditionally, butthe number of internal deviations grows quadratically with the number of strategies in a game, whilethe number of external deviations is linear. However, the number of strategies in an EFG grows ex-ponentially with the size of the game, making it appear intractable to be hindsight rational for evenexternal deviations. a r X i v : . [ c s . G T ] F e b he counterfactual regret minimization (CFR) (Zinkevich et al. 2007b) algorithm makes use ofthe EFG structure to be efficiently hindsight rational for external deviations. Modifications to CFRby Celli et al. (2020) and Morrill et al. (2020) are efficiently hindsight rational for other types ofdeviations as well. How far can we push this approach toward stronger deviation types?We present extensive-form regret minimization (EFR), a simple and extensible algorithm that inte-grates time selection regret minimization (Blum and Mansour 2007) into CFR to minimize the regretof any set of deviations where each deviation can be decomposed into micro-level action transfor-mations. It is generally intractable to run EFR with the full set of such deviations so we identify fournew deviation types that subsume known deviation types—external, causal (Forges and von Sten-gel 2002; von Stengel and Forges 2008; Dud´ık and Gordon 2009), action (von Stengel and Forges2008), and counterfactual (Morrill et al. 2020)—without an exponential increase in computation.We provide EFR instantiations and sublinear regret bounds for each of these new “partial sequence”deviation types.We present a thorough empirical analysis of EFR’s performance with different deviation types inbenchmark games from OpenSpiel (Lanctot et al. 2019). As theory suggests, EFR instances withstronger deviation types tend to outperform those with weaker types. Furthermore, EFR with thestrongest partial sequence deviation type we describe, twice informed partial sequence, often per-forms nearly as well as EFR with an unrestricted deviation set, in games where the latter is tractable. This work will continuously reference decision making from both the macroscopic, normal-form view, and the microscopic, extensive-form view. We first describe the normal-form view, which mod-els simultaneous decision making, before extending it with the extensive-form view, which modelssequential decision making.
At the macro-scale, players in a game choose strategies that jointly determine the utility for eachplayer. We assume a bounded utility function u i : Z → [ − U, U ] for each player i on a finite set ofoutcomes, Z . Each player has a finite set of pure strategies , s i ∈ S i , describing their decision space.A set of results for entirely random events, e.g ., die rolls, is denoted S c . A pure strategy profile , s ∈ S = S c × × Ni =1 S i , is an assignment of pure strategies to each player, and each strategy profilecorresponds to an outcome z ∈ Z determined by the indicator function P ( z ; s ) ∈ { , } .A mixed strategy , π i ∈ Π i = ∆ | S i | , is a probability distribution over pure strategies. In general,we assume that strategies are mixed where pure strategies are point masses. The probability ofa chance outcome, s c ∈ S c , is determined by the “chance player” who plays the fixed strategy π c . A mixed strategy profile , π ∈ Π = { π c } × × Ni =1 Π i , is an assignment of mixed strategies toeach player. The probability of sampling a pure strategy profile, s , is the product of sampling eachpure strategy individually, i.e ., π ( s ) = π c ( s c ) (cid:81) Ni =1 π i ( s i ) . The probability of reaching outcome z according to profile π is P ( z ; π ) = E s ∼ π [ P ( z ; s )] , allowing us to express player i ’s expected utilityas u i ( π ) = E z ∼ P ( · ; π ) [ u i ( z )] .A natural evaluation metric to compare the relative effectiveness of strategies is the difference intheir expected utilities. The regret for playing strategy π i instead of an alternative strategy π (cid:48) i is theirdifference in expected utility u i ( π (cid:48) i , π − i ) − u i ( π ) . We construct alternative strategies with strategytransformations or swap deviations (Greenwald, Jafari, and Marks 2003), φ : S i → S i , wheremixed strategy π i is transformed to the mixed strategy φ ( π i ) that assigns probability [ φπ i ]( s (cid:48) i ) = (cid:80) s i ∈ φ − ( s (cid:48) i ) π i ( s i ) to each pure strategy s (cid:48) i and φ − : s (cid:48) i (cid:55)→ { s i | φ ( s i ) = s (cid:48) i } is the pre-image of φ .Swap transformation sets are denoted like Φ SW where the swap domain will be noted by a subscript.The regret for playing strategy π i instead of applying deviation φ is then ρ ( φ ; π ) = u i ( φ ( π i ) , π − i ) − u i ( π ) . We evaluate a sequence of T strategies, ( π ti ) Tt =1 , against a deviation, φ , with the cumulativeregret ρ T ( φ ) . = (cid:80) Tt =1 ρ ( φ ; π t ) . 2 indsight Rationality Hindsight rationality (Morrill et al. 2020) suggests that multi-agent, general-sum games should beplayed by algorithms that learn to eliminate their average regret over time. In an online learningsetting, a learner repeatedly plays a game with unknown, dynamic, possibly adversarial players. Oneach round ≤ t ≤ T , the learner who acts as player i chooses a strategy, π ti , simultaneouslywith the other players who in aggregate choose π t − i . A learner is rational in hindsight with respectto a set of deviations, Φ ⊆ Φ SW S i , if the maximum regret, R T (Φ) = max φ ∈ Φ ρ T ( φ ) , is zero.A no-regret or hindsight rational algorithm ensures that average maximum regret vanishes, i.e ., lim T →∞ T R T (Φ) ≤ .The empirical distribution of play , µ T ∈ ∆ | S | , is the distribution that summarizes online correlatedplay, i.e ., µ T ( s ) = T (cid:80) Tt =1 π t ( s ) , for all pure strategy profiles, s . The incentive to deviate ac-cording to φ from “mediator recommendations” sampled from distribution µ T is the average regret E s ∼ µ T [ ρ ( φ ; s )] = T ρ T ( φ ) . Jointly hindsight rational play realizes a mediated equilibrium (Au-mann 1974), parameterized by the deviation sets evaluating each player, because this incentive van-ishes over time for each player.The deviation set influences what behaviors are considered rational and the difficulty of ensuringhindsight rationality. E.g ., the set of constant transformations are known as the external devia-tions and are denoted like Φ EX . Even though these transformations are rudimentary, it is generallyintractable to directly minimize regret with respect to external deviations in sequential decision-making settings because the number of pure strategies grows exponentially with the number ofdecision points. An extensive-form game ( EFG ) models player behavior as a sequence of decisions. Outcomes, herecalled terminal histories , are constructed incrementally from the empty history, ∅ ∈ H . At anyhistory h , one player determined by the player function P : H \ Z → { , . . . , N } ∪ { c } plays an action , a ∈ A ( h ) , from a finite set, which advances the current history to ha . We write h (cid:64) ha todenote that h is a predecessor of ha . We denote the maximum number of actions at any history as n A .Histories are partitioned into information sets to model imperfect information, e.g ., private cards.The player to act in each history h ∈ I in information set I ∈ I must do so knowing only thatthe current history is in I . The unique information set that contains a given history is returned by I (“blackboard I”) and an arbitrary history of a given information set is returned by h (“blackboardh”). Naturally, the action sets of each history within an information set must coincide, so we overload A ( I ) = A ( h ( I )) .Each player i has their own information partition , denoted I i . We restrict ourselves to perfect-recall information partitions that ensure players never forget the information sets they encounter duringplay and their information set transition graphs are forests (not trees since other players may actfirst). We write I ≺ I (cid:48) to denote that I is a predecessor of I (cid:48) , d I (cid:48) = (cid:80) I ≺ I (cid:48) to denote the depthof I (cid:48) , and p ( I (cid:48) ) to reference the parent (immediate predecessor) of I (cid:48) . We use a → I (cid:48) h or a → I (cid:48) I toreference the unique action required to play from h ∈ I to a successor history in I (cid:48) (cid:31) I . Strategies and Reach Probabilities
From the extensive-form view, a pure strategy is an assignment of actions to each of a player’sinformation sets. A natural generalization is to randomize at each information set, leading to thenotion of a behavioral strategy (Kuhn 1953). A behavioral strategy is defined by an assignmentof immediate strategies , π i ( I ) ∈ ∆ |A ( I ) | , to each of player i ’s information sets, where π i ( a | I ) isthe probability that i plays action a in I . Perfect recall ensures realization equivalence between theset of mixed and behavioral strategies where there is always a behavioral strategy that applies thesame weight to each terminal history as a mixed strategy and vice-versa . Thus, we treat mixed andbehavioral strategies (and by extension pure strategies) as interchangeable representations.3ince histories are action sequences and behavioral strategies define conditional action probabilities,the probability of reaching a history under a profile is the joint action probability that follows fromthe chain rule of probability. We overload P ( h ; π ) to return the probability of a non-terminal history h . Furthermore, we can look at the joint probability of actions played by just one player or a subsetof players, denoted, for example, as P ( h ; π i ) or P ( h ; π − i ) . We can use this and perfect recall todefine the probability that player i plays to their information set I ∈ I i as P ( h ( I ); π i ) . Additionally,we can exclude actions taken before some initial history h to get the probability of playing from h to history h (cid:48) , written as P ( h, h (cid:48) ; · ) , where it is if h = h (cid:48) and if h (cid:54)(cid:118) h (cid:48) . We begin our discussion of learning in EFGs by designing an evaluation metric for individual ac-tions. The payoff following an action depends on the actions that will be chosen afterward and theacting player’s beliefs, i.e ., the probability distribution over histories in the information set.
Counter-factual values (Zinkevich et al. 2007b) encapsulate precisely these two components. Given a strategyprofile, π , the counterfactual value for taking a in information set I ∈ I i is the expected utility as-suming the other players play according to π − i throughout and that player i plays to reach I beforeplaying π i thereafter, i.e ., (cid:88) h ∈ I,z ∈Z P ( h ; π − i ) (cid:124) (cid:123)(cid:122) (cid:125) h probability . P ( ha, z ; π ) u i ( z ) (cid:124) (cid:123)(cid:122) (cid:125) Future value given ha . . Counterfactual values do not take into account the probability that π i reaches I , which is irrelevant to i ’s beliefs and play after I . Counterfactual values provide a method of evaluating actions in isolationfrom π i ’s behavior at irrelevant information sets (all I (cid:48) (cid:54)(cid:23) I ), via immediate counterfactual regret ,which is the extra counterfactual value achieved by choosing a given action instead of following π i at I , i.e ., ρ CF I ( a ; π ) = (cid:88) h ∈ I,z ∈Z P ( h ; π − i ) u i ( z )( P ( ha, z ; π ) − P ( h, z ; π )) To modify player i ’s strategy so that it plays to reach a “target” information set and plays a particularaction there is to apply a blind counterfactual deviation (Morrill et al. 2020). The counterfactualvalue of action a is therefore the expected utility from I of the blind counterfactual deviation thattargets I and plays a there . Thus, the full regret for a counterfactual deviation cannot be morethan the sum of immediate counterfactual regrets at each of the information sets leading to thetarget (Morrill et al. 2020).Notice that there are only |I i | counterfactual deviations and immediate counterfactual regrets andyet any external deviation can be reproduced by applying some combination of counterfactual de-viations. Since there are O ( n |I i |A ) external deviations, there is a massive computational advantagein using counterfactual deviations, which exploit the structure of EFGs. The counterfactual regretminimization ( CFR ) (Zinkevich et al. 2007b) learning algorithm is efficiently hindsight rational forexternal deviations because CFR works directly with counterfactual deviations and immediate coun-terfactual regrets.Iterating on the same approach, Celli et al. (2020) define laminar subtree trigger regret ( immediatetrigger regret in our terminology), as the regret at I under counterfactual values weighted by theprobability that player i plays to a given predecessor and plays a particular action there. Their ICFR modification of pure CFR (Gibson 2014) ( i.e ., CFR where strategies are purified by sampling actionsat each information set) is hindsight rational for informed causal deviations (Forges and von Stengel2002; von Stengel and Forges 2008; Dud´ık and Gordon 2009). Morrill et al. (2020) also observe thatsimply weighting the counterfactual regret at I by the probability that player i plays to I modifiesCFR so that it is hindsight rational for blind action deviations. A proper belief would sum to one across histories in an information set, but if the other players do not playin a way that leads to I , then the normalization factor is zero. Normalization also nullifies information aboutchanges in how likely the other players play to I over time. “Blind” refers to the fact that all strategy modifications are constant transformations. In contrast, an “in-formed” deviation modifies strategies conditionally depending on some portion of the strategy. Behavioral Deviations
A key property that counterfactual regret, trigger regret, and reach probability weighted counterfac-tual regret share is that valuations of both the original and deviation behavior are weighted the sameto provide a fair evaluation.
I.e ., at a given information set I on round t , they are weighted counter-factual regrets w t ρ CF I ( · ; π t ) , for some w t ≥ . The weight w t can be interpreted as the probabilitywe imagine both player i and the deviation play to I .All of the aforementioned regret concepts set w t to the deviation reach probability, P ( h ( I ); φ ( π ti )) ,and thus imagine that i plays to I according to the deviation behavior, φ ( π ti ) . For counterfactualdeviations, this means imagining that i plays to I deterministically. For informed causal deviations,this means imagining that i either plays to I according to π ti or plays to a predecessor of I accordingto π ti and plays to I from there according to the probability of the predecessor’s “trigger” action.And for blind action deviations, it means imagining that i plays to I according to π ti .The essential differences between the CFR-like algorithms that use these regret concepts—the com-putation they require and the types of hindsight rationality they exhibit—are entirely caused by theirweighting schemes. Generalizing from these examples, we introduce immediate Φ -regret to captureall hindsight rationality objectives described by behavioral deviations, a natural and broad class ofdeviations specific to EFGs.A behavioral deviation is a deviation where immediate strategy modifications are made indepen-dently at each information set and were originally described by Definition 2.2 of von Stengel andForges (2008). Formally, a behavioral deviation, φ , assigns an action transformation , φ I ∈ Φ SW A ( I ) ,to each information set I so that the immediate strategy at I becomes [ φπ i ]( I ) = φ I ( π i ( I )) . Ourname for these deviations is meant to draw an analogy to behavioral strategies. We denote the set ofall behavioral deviations on a set of information sets I (cid:48) ⊆ I as Φ SW I (cid:48) by overloading the Φ notation. How can we define immediate regrets for behavioral deviations? The first step is to establish nota-tion to manipulate the component action transformations within a behavioral deviation and to thinkabout these components as behavioral deviations themselves. So φ I refers to behavioral deviation φ ’s action transformation at I , as we have seen already, but we could also view this as a behavioraldeviation that transforms the immediate strategy at I and otherwise leaves the strategy unmodi-fied. We extend this subscript notation to subsets, e.g ., φ ≺ I (cid:48) = { φ I | I ≺ I (cid:48) } is the sequence oftransformations used from the start of the game to I , or the deviation that applies φ at each of I ’s pre-decessors and the identity transformation at all other information sets. To make φ ≺ I and φ (cid:22) p ( I ) welldefined for information sets at the start of the game, we declare them to be empty sets or the identitydeviation when either is applied to a strategy. To combine deviation fragments, we use concatena-tion, e.g ., φ = φ ≺ I φ I φ (cid:31) I . We extend the same subscript notation to describe sets of transformationsequences, e.g ., Φ SW (cid:22) I (cid:48) = Φ SW { I | I (cid:22) I (cid:48) } , is the set of behavioral deviations that apply transformationsonly to each predecessor of I (cid:48) and I (cid:48) itself.Now, to answer our earlier question, we define immediate Φ -regret as counterfactual regret weightedaccording to the reach probabilities of transformation sequence φ ≺ I ∈ Φ SW ≺ I , i.e ., ρ TI ( φ (cid:22) I ) = T (cid:88) t =1 P ( h ( I ); φ ≺ I ( π ti )) ρ CF I ( φ I ; π t ) , where we generalize counterfactual regret to action transformations as ρ CF I ( φ I ; π ) = E a ∼ φ I ( π i ( I )) [ ρ CF I ( a ; π )] . A basic CFR-like algorithm must store and update all immediate regretsat each information set so how many immediate Φ -regrets are there?The number of swap transformations at any of I ’s predecessors is at most n n A A . Since I has d I predecessors, the number of swap transformation combinations is n d I n A A . Combining each of these Definition 2.2 of von Stengel and Forges (2008) defines extensive-form correlated equilibrium (EFCE) interms of behavioral deviations, which are generally distinct from the informed causal deviations that typicallydefine EFCE in the artificial intelligence literature. The reason for this discrepancy is that EFCE has come to beassociated with the assumption that immediate strategies at unreachable information sets are “uninformative”, i.e ., π i ( a | I ) = 1 for some action a for strategy π i if P ( h ( I ); π i ) = 0 . This assumption is necessary ifstrategies are in reduced- or strategic-form , where π i ( I ) is undefined if I is unreachable. The set of behavioraldeviations then collapses to the set of causal deviations (von Stengel and Forges 2008). I gives us n ( d I +1) n A A immediate Φ -regrets. However, many of these are irrelevant or redundant.A swap transformation, e.g ., swapping two actions, a and a , to a third action, a , is really just thecombination of two internal transformations (Foster and Vohra 1999) (denoted like Φ IN ): one thatmaps a → a and any other action to itself plus one that maps a → a and any other action toitself. Using such reasoning, one can show it is sufficient to eliminate internal regret to also eliminateswap regret (Greenwald, Jafari, and Marks 2003) and sufficient eliminate immediate Φ -regret at I (cid:48) with respect to transformation sequences that move probability mass from one action to a → I (cid:48) I at eachpredecessor I ≺ I (cid:48) to eliminate immediate Φ -regret there with respect to all behavioral deviations(see Proposition 1 in Appendix A). This leaves at most n d I A relevant transformation sequences and n A − n A relevant action transformations at I . The total number of relevant immediate Φ -regrets isthen at most their product n d I A (cid:0) n A − n A (cid:1) .While our pruning has decreased the number of immediate Φ -regrets substantially from O ( n ( d I +1) n A A ) to O ( n d I +2 A ) , it still grows exponentially with depth. We will still provide a hindsightrational algorithm for behavioral deviations but it will only be tractable in short games . Instead, arethere more efficient deviation types that are stronger than previously identified tractable deviationtypes? We now introduce four types of partial sequence deviations that are both efficient and powerful.We define a partial sequence deviation as one that modifies an arbitrary length sequence of actionsinstead of a sequence that must begin at the start of the game or terminate at the end of the game.Any such deviation exhibits arbitrary length correlation, de-correlation, and re-correlation phases.All previously studied tractable deviation types are special cases where the re-correlation phaseis absent (causal deviations), where the correlation phase is absent (counterfactual deviations), orwhere the length of de-correlation is limited to one (action deviations).The correlation phase is the initial segment of a deviation strategy that leaves the input strategyunmodified until it reaches a “trigger” information set. “Correlation” references the fact that the inputstrategy can be correlated with the strategies of the other players. The de-correlation phase modifiesthe input strategy to play to a target information set from the trigger information set, breaking anycorrelation with the other players. The re-correlation phase follows a de-correlation phase and leavesthe input strategy unmodified after play in the target information set, thereby restoring correlationwith the other players.The simplest type of partial sequence deviation is the blind partial sequence ( BPS ) deviation whereall action transformations in its de-correlation phase are external. There are three alternative versionsof informed partial sequence deviations due to an asymmetry between informed causal and informedcounterfactual deviations. A causal partial sequence ( CSPS ) deviation uses an internal transforma-tion at the trigger information set while a counterfactual partial sequence ( CFPS ) deviation usesan internal transformation at the target information set. A twice informed partial sequence ( TIPS )deviation uses internal transformations at both positions, making it the strongest of our partial se-quence deviation types. See Table C.1 in Appendix C for a formal definition of each partial sequencedeviation type and Fig. 1 for an illustration of each type within the EFG deviation landscape.It is possible to convert more of the external transformations in the de-correlation phase in a partialsequence deviation to internal transformations, however, each one adds an n A factor to the numberof Φ -regrets. Converting all of them results in the full set of behavioral deviations except that theyare defined with internal rather than swap deviations (which are just as powerful, as we saw inSection 3). This reveals how our partial sequence deviations remain efficient: they restrict how manyinternal transformations—essentially restricting how much information about the input strategy—can be used to generate a deviation strategy. The number of relevant deviations contained withineach deviation type is listed in Table 1. The regret matching++ algorithm (Kash, Sullins, and Hofmann 2020) could ostensibly be used to mini-mize regret with respect to all weightings simultaneously, however, there is an error in the proof of the regretbound. In Appendix D, we give an example where regret matching++ suffers linear regret and we show that noalgorithm can have a sublinear bound on the sum of the positive instantaneous regrets. causal action CF blind causal action CF external twice informed PScausal PS counterfactual PS blind PS
Figure 1: A summary of the deviation landscape in EFGs. Within each pictogram, the game plays outfrom top to bottom. Straight lines are actions, zigzags are action sequences, and triangles are decisiontrees. Unmodified immediate strategies are colored black, modifications are colored red, and triggerinformation is colored cyan. Arrows denote superset → subset relationships (and therefore subset → superset equilibrium relationships).For the rest of the paper, we develop extensive-form regret minimization ( EFR ), a general and ex-tensible algorithm that is hindsight rational for any given behavioral deviation subset. We showhow EFR can be instantiated for our partial sequence deviation types and how its computationalrequirements and regret bounds change depending on the given deviation subset. We give a generalregret bound for any behavioral deviation subset and show the constant factors specific to each par-tial sequence deviation type. The key innovation of this algorithm is the use of time selection regretminimization (Blum and Mansour 2007) to manage immediate φ -regrets within the CFR framework. In a time selection problem (Blum and Mansour 2007), there is a finite set of M ( φ ) functions ofthe round t , W ( φ ) = { t (cid:55)→ w tj ∈ [0 , } M ( φ ) j =1 , for each deviation φ ∈ Φ ⊆ Φ SW S i . The regretwith respect to deviation φ and time selection function w ∈ W ( φ ) after T rounds is ρ T ( φ, w ) . = (cid:80) Tt =1 w t ρ ( φ ; π t ) . The goal is to ensure that each of these regrets grow sublinearly, not only for alldeviations but for all deviation–time selection function pairs. This can be accomplished by simplytreating each such pair as a separate transformation (here called an expert ) and applying a no-regretalgorithm.We introduce a (Φ , f ) -regret matching (Hart and Mas-Colell 2000; Greenwald, Li, and Marks 2006)algorithm for the time selection setting with a regret bound that depends on the largest time selectionfunction set, M ∗ = max φ ∈ Φ M ( φ ) . 7able 1: The order of the number of deviations of each deviation type where d ∗ = max I ∈I i d I isthe depth of player i ’s deepest information set. Type O )internal n |I i |A behavioral ( d ∗ ) n A +2 |I i | TIPS d ∗ n A |I i | CSPS d ∗ n A |I i | CFPS d ∗ n A |I i | BPS d ∗ n A |I i | informed causal n |I i | +1 A |I i | informed action n A |I i | informed counterfactual n A |I i | blind causal n |I i |A |I i | blind action n A |I i | blind counterfactual n A |I i | external n |I i |A Corollary 1.
Given deviation set Φ ⊆ Φ SW S i and finite time selection sets W ( φ ) = { w j ∈ [0 , T } M ( φ ) j =1 for each deviation φ ∈ Φ , (Φ , · + ) -regret matching chooses a strategy on each round ≤ t ≤ T as the fixed point of L t : π i (cid:55)→ / z t (cid:80) φ ∈ Φ φ ( π i ) y tφ or an arbitrary strategy when z t = 0 , where link outputs are generated from exact regrets y tφ = (cid:80) w ∈ W ( φ ) w t ( ρ t − ( φ, w )) + and z t = (cid:80) φ ∈ Φ y tφ . This algorithm ensures that ρ T ( φ, w ) ≤ U (cid:112) M ∗ ω (Φ) T for any devia-tion φ and time selection function w , where ω (Φ) = max a ∈A (cid:80) φ ∈ Φ { φ ( a ) (cid:54) = a } is the maximalactivation of Φ (Greenwald, Li, and Marks 2006). While we only present the bound for the rectified linear unit ( ReLU ) link function, · + : x (cid:55)→ max { , x } , the arguments involved in proving Corollary 1 apply to any link function; only thefinal bound would change. This result is a consequence of two more general theorems presented inAppendix B, one that allows regret approximations `a la D’Orazio et al. (2020) (motivating the useof function approximation) and another that allows predictions of future regret, i.e ., optimistic regretmatching (D’Orazio and Huang 2021). Appendix B also contains analogous results for the regretmatching + (Tammelin 2014; Tammelin et al. 2015a) modification of regret matching. Let us return to the immediate Φ -regrets described in Section 3. If we choose immediate strategies atinformation set I according to time selection regret matching where we set the regret matching devi-ation set to Φ I and the time selection function set to W I ( φ ) = { t (cid:55)→ P ( h ( I ); φ ≺ I ( π ti )) } φ ≺ I φ I ∈ Φ (cid:22) I for each action transformation φ I ∈ Φ I , then immediate Φ -regret is bounded at I . For the fullset of behavioral deviations, it is bounded by U (cid:113) T n d I A ( n A − n A ) . By simply applying the sameprocedure and reasoning to each information set, a bound like this applies to all information sets si-multaneously. This procedure outlines our new algorithm, extensive-form regret minimization ( EFR ).Algorithm 1 provides an implementation of EFR with exact regret matching.So far we have only shown that EFR would eliminate immediate Φ -regrets, which compare theperformance of two strategy sequences generated by partially applying a deviation φ . Hindsightrationality however requires a learner to be no-regret with respect to strategies generated by applying φ completely. Like previous work, we overcome this obstacle by relating immediate regret to fullregret. 8 lgorithm 1 EFR update for player i with exact regret matching. Input:
Strategy profile, π t , and valid transformation sequences and action transformations, { Φ (cid:22) I ⊆ Φ SW (cid:22) I , Φ I ⊆ Φ SW A ( I ) } I ∈I i . let ρ · ( · ) = 0 . Φ -regrets: for I ∈ I i , φ I ∈ Φ I , φ ≺ I | φ ≺ I φ I ∈ Φ (cid:22) I do ρ tI ( φ ≺ I φ I ) ← ρ t − I ( φ ≺ I φ I ) + P ( h ( I ); φ ≺ I ( π ti )) ρ CF I ( φ I ; π t ) end for π t +1 i with regret matching: for I ∈ I i from the start of the game to the end do for φ I ∈ Φ I do P ( h ( I ); φ ≺ I ( π t +1 i )) only requires π t +1 i to be defined at previous information sets. y tφ I ← (cid:80) φ ≺ I | φ ≺ I φ I ∈ Φ (cid:22) I P ( h ( I ); φ ≺ I ( π t +1 i ))( ρ tI ( φ ≺ I φ I )) + end for z t ← (cid:80) φ I ∈ Φ I y tφ I if z t > then π t +1 i ( I ) ← a fixed point of linear operator L t : ∆ |A ( I ) | (cid:51) σ (cid:55)→ z t (cid:80) φ I ∈ Φ I φ I ( σ ) y tφ I else π t +1 i ( I ) is arbitrary, e.g ., [ π t +1 i ( a | I ) ← A ( I ) ] a ∈A ( I ) end if end foroutput π t +1 i The full Φ -regret at information set I is the regret between the strategies generated by φ and thosegenerated by applying φ ≺ I , i.e ., ρ I ( φ ; π ) = (cid:88) h ∈ I,h (cid:64) z ∈Z P ( z ; φ ( π i ) , π − i ) u i ( z ) − P ( h ; φ ≺ I ( π i ) , π − i ) P ( h, z ; π ) u i ( z ) . See Appendix C for a statement and proof of Lemma 1, which essentially shows that minimizingimmediate Φ -regret also minimizes full Φ -regret. Therefore, minimizing immediate Φ -regret at ev-ery information set also minimizes full Φ -regret at every information set, including those at the startof the game, which notably lack predecessors. The full Φ -regret at the start of the game is thenexactly the total performance difference between φ and the learner’s behavior. Finally, this impliesthat minimizing immediate Φ -regret ensures hindsight rationality with respect to Φ . EFR’s regret isbounded according to the following theorem: Theorem 1.
Instantiate EFR for player i with exact regret matching and deviations Φ defined byvalid transformation sequences and action transformations, { Φ (cid:22) I ⊆ Φ SW (cid:22) I , Φ I ⊆ Φ SW A ( I ) } I ∈I i . Let C I = max φ I ∈ Φ I (cid:80) φ ≺ I ∈ Φ (cid:22) p ( I ) { φ ≺ I φ I ∈ Φ (cid:22) I } be the maximum number of transformation se-quences that are valid for any immediate transformation at any information set after the start ofthe game and C I = 1 otherwise. Let D = max I C I ω (Φ I ) . Then, EFR’s cumulative regret after T rounds with respect to Φ is upper bounded by U |I i |√ DT . See Appendix C for technical details.The variable D in the EFR regret bound that depends on the particular behavioral deviation subsetthat it is instantiated with is often the number of immediate Φ -regrets generated by that subsetdivided by the number of information sets. D is slightly larger for CSPS because it uses the unionof internal and external transformations for its action transformation set, Φ I , at all information setsexcept those at the beginning of the game (see Appendix C for more details).Crucially, EFR’s generality does not come at a computational cost; EFR reduces to the CFR algo-rithms previously described to handle counterfactual or action deviations (Zinkevich et al. 2007b;9orrill et al. 2020). For either counterfactual or action deviations, there is only ever one valid trans-formation sequence that leads to each information set: the sequence of external transformationsor the sequence of identity transformations, respectively. With causal partial sequence deviations,EFR also reduces to a version of ICFR. ICFR is pure EFR (analogous to pure CFR) except that theexternal and internal action transformation learners are sampled and updated independently. EFRtherefore improves on this algorithm (beyond its generality) in that its learners share all experience,ostensibly leading to better policies in fewer rounds, and by having a deterministic finite time regretbound. Furthermore, EFR also inherits CFR’s flexibility as it can be used with Monte Carlo sam-pling (Lanctot et al. 2009; Burch et al. 2012; Gibson et al. 2012; Johanson et al. 2012), function ap-proximation (Waugh et al. 2015; Morrill 2016; D’Orazio et al. 2020; Brown et al. 2019; Steinberger,Lerer, and Brown 2020; D’Orazio 2020), variance reduction (Schmid et al. 2019; Davis, Schmid,and Bowling 2020), and predictions (Rakhlin and Sridharan 2013; Farina et al. 2019a; D’Orazio andHuang 2021; Farina, Kroer, and Sandholm 2020). Our theoretical results show that EFR variants utilizing more powerful deviation types are pushed toaccumulate higher payoffs during learning. However, these results are statements about worst-caseperformance with respect to the game’s structure and the behavior of the other players. Do thesedeviation types make a practical difference outside of worst-case analysis?We investigate the online learning performance of EFR with different deviation types in nine bench-mark game instances from
OpenSpiel (Lanctot et al. 2019). We evaluate each EFR variant by theexpected payoffs accumulated over the course of playing each game (in each player seat) with eachother EFR variant over 500 rounds. In games with more than two players, we evaluate an EFR variantby instantiating it as one player and instantiating another EFR variant as the remaining players.We run each EFR variant in two regimes, one where the other players play the fixed sequence ofstrategies they used during self-play training and another where they are learning simultaneously.To elaborate on the fixed regime, we run each EFR variant in self-play and save the sequence ofstrategy profiles to be replayed during the evaluation of each other EFR variant. Thereby, the fixedregime provides a test of how well each EFR variant adapts to gradually changing but obliviousstrategies produced during self-play and makes EFR variant comparison simple. The simultaneousregime is a more dynamic (possibly more realistic) setting but also one where it is more difficult todraw definitive conclusions about the relative effectiveness of each EFR variant.Since we evaluate expected payoff, use expected EFR updates, and use exact regret matching, allresults are deterministic and hyperparameter-free. Experiments were run on a 2.4 GHz Dual-CoreIntel Core i5 processor with 16 GB of RAM.Appendix E hosts the full set of results but a representative summary from two variants of imper-fect information goofspiel (Ross 1971; Lanctot 2013) (a two-player and a three-player version) ispresented in Table 2. See Appendix E.1 for a description of each game.There is a consistent tendency for more powerful deviation types to outperform weaker ones in boththe fixed and the simultaneous regime. Behavioral deviations often perform best, and this is trueof each scenario presented in Table 2. TIPS or CSPS often performs nearly as well as behavioraldeviations, though the gap can be large as in the fixed regime of two-player goofspiel (g
2, 5 ). Ac-tion deviations (ACT IN ) typically perform exceptionally poorly, particularly in two-player, zero-sumgames.A notable outlier is three-player goospiel with a descending point deck rather than an ascending one.Here, counterfactual or blind partial sequence deviations win often within the first few rounds. Inthe fixed regime, most of other variants tend to lose and in the simultaneous regime, they tend towin slightly less often. However, in both regimes, all variants quickly converge to play that achievesessentially the same payoff (see Figures E.1-E.4 in Appendix E). We introduced EFR, an algorithm that is hindsight rational for any given set of behavioral deviations Φ ⊆ Φ SW I i . We achieved this by formulating time selection regret minimization problems at each in-10able 2: The win percentage of each EFR instance averaged across both rounds and each in-stance pairing (eight pairs in total) in two versions of goofspiel (g
2, 5 → goofspiel (5 , ↑ , N = 2) andg
3, 4 → goofspiel (4 , ↑ , N = 3) ). The top group of algorithms use weak deviation types ( ACT IN → in-formed action deviations, CF → blind counterfactual, and CF IN → informed counterfactual) and themiddle group use partial sequence deviation types. The BHV instance uses the full set of behavioraldeviations. fixed simultaneousg
2, 5 g
3, 4 g
2, 5 g
3, 4 † ACT IN
52 46 48 87 CF
56 50 51 84 CF IN
57 50 52 91
BPS
58 51 52 85
CFPS
59 51 52 83
CSPS
60 51 52 91
TIPS
60 51 53 88
BHV
65 51 53 93 † Two important aspects of three-player goofspiel are that players who tend to play the same actions tend toperform worse and the game is symmetric across player seats. Thus, if two players use the same algorithm,they will always employ the same strategies and often play the same actions. If the third player uses a differ-ent algorithm, they have a substantial advantage. Since we only evaluate EFR variants that are instantiatedas a single player in the simultaneous regime, the win percentage for all variants in this regime tends to behigh, although the relative comparison is still informative. formation set based on immediate Φ -regret and showing that hindsight rationality with respect to Φ is achieved when each immediate Φ -regret is minimized. While the full set of behavioral deviationsleads to generally intractable computational requirements, we identified four partial sequence devi-ation types that are both tractable and powerful in games with moderate lengths. In our experimentstesting online play in benchmark games, EFR with partial sequence deviation types tended to out-perform those with weaker deviation types and EFR with TIPS deviations often performed nearly aswell as EFR with all behavioral deviations, in games where the latter is tractable. Acknowledgements
Dustin Morrill and Michael Bowling are supported by the Alberta Machine Intelligence Institute(Amii) and NSERC. James Wright is supported by a Canada CIFAR AI Chair at Amii. Amy Green-wald is supported in part by NSF Award CMMI-1761546. Thanks to Ian Gemp for constructivecomments and suggestions.
A Swap Vs . Internal Transformations at I ≺ I (cid:48) Here we show why it is not fundamentally more expressive to consider deviations that utilize swaptransformations instead of internal transformations at predecessor information sets. We use a → I (cid:48) h or a → I (cid:48) I to reference the unique action required to play from h ∈ I to a successor history in I (cid:48) (cid:31) I . Proposition 1.
Given a pair of information sets, I ≺ I (cid:48) , and a sequence of strategy profiles, ( π t ) Tt =1 ,the following relationship holds for any action transformation assignment, φ I ∈ Φ SW A ( I ) , φ I (cid:48) ∈ Φ SW A ( I (cid:48) ) : T (cid:88) t =1 [ φ I π ti ]( a → I (cid:48) I | I ) ρ CF I (cid:48) ( φ I (cid:48) ; π t )= (cid:88) a ∈ φ − I ( a → I (cid:48) I ) T (cid:88) t =1 π ti ( a | I ) ρ CF I (cid:48) ( φ I (cid:48) ; π t ) . This definition was mistakenly omitted from the main paper and will be included in future versions. hat is, the immediate Φ -regret of deviation φ I (cid:48) following swap transformation φ I is equal to theimmediate counterfactual regret of φ I (cid:48) weighted by the internal a → a → I (cid:48) I transformation weights, [ π ti ( a | I )] a ∈ φ − I ( a → I (cid:48) I ) .Proof. T (cid:88) t =1 [ φ I π ti ]( a → I (cid:48) I | I ) ρ CF I (cid:48) ( φ I (cid:48) ; π t )= T (cid:88) t =1 ρ CF I (cid:48) ( φ I (cid:48) ; π t ) (cid:88) a ∈ φ − I ( a → I (cid:48) I ) π ti ( a | I ) . (1) = (cid:88) a ∈ φ − I ( a → I (cid:48) I ) T (cid:88) t =1 ρ CF I (cid:48) ( φ I (cid:48) ; π t ) π ti ( a | I ) . (2) E.g ., if there are three actions at I , { a , a , a → I (cid:48) I } , then the weighted regret for swapping both a and a to a → I (cid:48) I is equal to the sum of the regrets weighted by individually swapping a and a , i.e ., T (cid:88) t =1 (cid:0) π ti ( a | I ) + π ti ( a | I ) (cid:1) ρ CF I (cid:48) ( φ I (cid:48) ; π t )= T (cid:88) t =1 π ti ( a | I ) ρ CF I (cid:48) ( φ I (cid:48) ; π t ) + π ti ( a | I ) ρ CF I (cid:48) ( φ I (cid:48) ; π t ) . (3) B Regret Matching for Time Selection
In an online decision problem (also called a prediction with expert advice problem), regret match-ing is a learning algorithm that accumulates a vector of regrets, ρ t − —one for each deviation or“expert”, φ ∈ Φ ⊆ Φ SW S i —and chooses its mixed strategy, π t , on each round as the fixed pointof a linear operator. We generalize this algorithm and three extensions—regret matching + , regretapproximation, and predictions—to the time selection setting. B.1 BackgroundRegret Matching
The regret matching operator is constructed from a vector of non-negative link outputs , y t ∈ R | Φ | + ,generated by applying a link function, f : R | S i | → R | S i | + , to the cumulative regrets, i.e ., y t = f ( ρ t − ) . The operator is defined as L t : π i (cid:55)→ z t (cid:88) φ ∈ Φ φ ( π i ) y tφ , (4)where z t = (cid:80) φ ∈ Φ y tφ is the sum of the link outputs, and π ti is chosen arbitrarily if z t = 0 .Regret bounds are generally derived for regret matching algorithms by choosing f = αg for some α > , where g is part of a Gordon triple (Gordon 2005), ( G, g, γ ) . A Gordon triple is a tripleconsisting of a potential function, G : R n → R , a scaled link function g : R n → R n + , and a sizefunction, γ : R n → R + , where they satisfy the generalized smoothness condition G ( x + x (cid:48) ) ≤ G ( x ) + x (cid:48) · g ( x ) + γ ( x (cid:48) ) for any x, x (cid:48) ∈ R n . By applying the potential function to the cumulativeregret, we can unroll the recursive bound to get a simple bound on the cumulative regret itself.While its bounds are not quite optimal, Hart and Mas-Colell (2000)’s original regret matching al-gorithm, defined with the rectified linear unit ( ReLU ) link function · + : x (cid:55)→ max { x, } is often12xceptionally effective in practice (see, e.g ., Waugh and Bagnell (2015); Burch (2017)). We focusour analysis on this link function, but our arguments readily apply to other link functions. Only thefinal regret bounds will change. We follow the typical convention for analyzing Hart and Mas-Colell(2000)’s regret matching with γ ( x ) = (cid:107) x (cid:107) , G ( x ) = γ ( x + ) , and g = f . Regret Matching + Instead of the cumulative regrets, regret matching + updates a vector of pseudo regrets (sometimescalled “q-regrets”), q t = ( q t − + ρ t ) + ≥ ρ t (Tammelin 2014; Tammelin et al. 2015b). If weassume a positive invariant potential function where G (( x + x (cid:48) ) + ) ≤ G ( x + x (cid:48) ) , then the same regretbounds follow from the same arguments used in the analysis of regret matching D’Orazio (2020).Note that this condition is satisfied with equality for the quadratic potential G ( x ) = (cid:107) x + (cid:107) . Regret Approximation
Approximate regret matching is regret matching with approximated cumulative regrets, (cid:94) ρ t − ≈ ρ t − (Waugh et al. 2015; D’Orazio et al. 2020) or q-regrets, (cid:94) q t − ≈ q t − (Morrill 2016;D’Orazio 2020). The regret of approximate regret matching depends on its approximation accu-racy and motivates the use of function approximation when it is impractical to store and update theregret for each deviation individually. While it requires an extra assumption, we derive simpler ap-proximate regret matching bounds than those derived by D’Orazio et al. (2020); D’Orazio (2020)through an analysis of regret matching with predictions. Optimism via Predictions
Optimistic regret matching augments its link inputs by adding a prediction of the instantaneous regreton the next round, i.e ., m t ∼ ρ t . If the predictions are accurate then the algorithm’s cumulative regretwill be very small. This is a direct application of optimistic Lagrangian Hedging (D’Orazio andHuang 2021) to Φ -regret. The general approach of adding predictions to improve the performanceof regret minimizers originates with Rakhlin and Sridharan (2013); Syrgkanis et al. (2015).D’Orazio and Huang (2021)’s analysis requires that G and g satisfy G ( x (cid:48) ) ≥ G ( x ) + (cid:104) g ( x ) , x (cid:48) − x (cid:105) ,which is achieved, for example, if G is convex and g is a subgradient of G . Note that this is achievedfor Hart and Mas-Colell (2000)’s regret matching because Greenwald, Li, and Marks (2006) showsthat the ReLU function is the gradient of the convex quadratic potential G ( x ) = (cid:107) x + (cid:107) . B.2 Time Selection
To adapt regret matching to the time selection framework, we treat each deviation–time selectionfunction pair as a separate expert and sum over the link outputs corresponding to a given deviation toconstruct the regret matching operator. Our goal is then to ensure that each element of the cumulativeregret matrix, ρ T , grows sublinearly, where each index in the second dimension corresponds to atime selection function. Each deviation φ ∈ Φ is assigned a finite set of time selection functions, w ∈ W ( φ ) , so the regret matrix entries corresponding to ( φ, w ) -pairings where w / ∈ W ( φ ) , arealways zero.To facilitate a unified analysis, we assume a general optimistic regret matching algorithm that, after t − rounds, uses link outputs y tφ = (cid:80) w ∈ W ( φ ) w t ( x tφ,w + m tφ,w ) + , where either x t = ρ t − or x t = q t − with x = , and m t is a matrix of arbitrary predictions or approximation errors.Notice that this means that x t + m t can be generated from a function approximator instead of storingeither term in a table. Denoting the weighted sum of the link outputs as z t = (cid:80) φ ∈ Φ y tφ , the regretmatching operator has the same form as initially defined, i.e ., L t : π i (cid:55)→ z t (cid:88) φ ∈ Φ φ ( π i ) y tφ . (5)With this, we can bound the regret of optimistic regret matching thusly: Theorem 2.
Establish deviation set Φ ⊆ Φ SW S i and finite time selection sets W ( φ ) = { w ∈ [0 , T } M ( φ ) j =1 for each deviation φ ∈ Φ . On each round ≤ t ≤ T , (Φ , · + ) -regret matching with re- pect to matrix x t (equal to either ρ t − or q t − ) and predictions m t chooses its strategy, π ti ∈ Π i ,to be the fixed point of L t : π i (cid:55)→ / z t (cid:80) φ ∈ Φ φ ( π i ) y tφ or an arbitrary strategy when z t = 0 , wherelink outputs are generated from y tφ = (cid:80) w ∈ W ( φ ) w t ( x tφ,w + m tφ,w ) + and z t = (cid:80) φ ∈ Φ y tφ . Thisalgorithm ensures that ρ T ( φ, w ) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 (cid:88) φ (cid:48) ∈ Φ , ¯ w ∈ W ( φ (cid:48) ) (cid:16) ¯ w t ρ ( φ (cid:48) ; π t ) − m tφ (cid:48) , ¯ w (cid:17) for every deviation φ and time selection function w .Proof. Let us overload W = (cid:83) φ ∈ Φ W ( φ ) and let a · ,w = [ a φ,w ] φ ∈ Φ for any matrix a ∈ R | Φ |×| W | .Then, for any time selection function, w ∈ W , the quadratic potential function, G ( x ) = (cid:107) x + (cid:107) isconvex, positive invariant (with equality), has the ReLU function as its gradient (Greenwald, Li, andMarks 2006), and is smooth with respect to γ ( x ) = (cid:107) x (cid:107) . Altogether, these properties imply that G (cid:16)(cid:0) x t · ,w + w t ρ t (cid:1) + (cid:17) = G (cid:16)(cid:0) x t · ,w + m t · ,w + w t ρ t − m t · ,w (cid:1) + (cid:17) (6) = G (cid:0) x t · ,w + m t · ,w + w t ρ t − m t · ,w (cid:1) (7) ≤ G (cid:0) x t · ,w + m t · ,w (cid:1) + (cid:104) w t ρ t − m t · ,w , (cid:0) x t · ,w + m t · ,w (cid:1) + (cid:105) + γ (cid:0) w t ρ t − m t · ,w (cid:1) , (8)where ρ t = [ ρ ( φ ; π t )] φ ∈ Φ is the vector of instantaneous regrets on round t .By convexity, G ( a ) − G ( b ) ≤ (cid:104)∇ G ( a ) , a − b (cid:105) , for any vectors a and b , so we substitute a = x t · ,w + m t · ,w and b = x t · ,w to bound G ( x t · ,w + m t · ,w ) −(cid:104) m t · ,w , ( x t · ,w + m t · ,w ) + (cid:105) ≤ G ( x t · ,w ) . Therefore, G (cid:16)(cid:0) x t · ,w + w t ρ t (cid:1) + (cid:17) ≤ G ( x t · ,w ) + w t (cid:104) ρ t , (cid:0) x t · ,w + m t · ,w (cid:1) + (cid:105) + γ (cid:0) w t ρ t − m t · ,w (cid:1) (9) = G (cid:16)(cid:0) x t · ,w (cid:1) + (cid:17) + w t (cid:104) ρ t , (cid:0) x t · ,w + m t · ,w (cid:1) + (cid:105) + γ (cid:0) w t ρ t − m t · ,w (cid:1) . (10)Summing the potentials across time selection functions, (cid:88) w ∈ W G (cid:16)(cid:0) x t · ,w + w t ρ t (cid:1) + (cid:17) ≤ (cid:88) w ∈ W G (cid:16)(cid:0) x t · ,w (cid:1) + (cid:17) + w t (cid:104) ρ t , (cid:0) x t · ,w + m t · ,w (cid:1) + (cid:105) + γ (cid:0) w t ρ t − m t · ,w (cid:1) . (11)With some algebra, we can rewrite the sum of inner products: (cid:88) w ∈ W w t (cid:104) ρ t , ( x t · ,w + m t · ,w ) + (cid:105) = (cid:88) w ∈ W (cid:88) φ ∈ Φ w t ρ ( φ ; π t ) (cid:0) x tφ,w + m tφ,w (cid:1) + (12) = (cid:88) φ ∈ Φ ρ ( φ ; π t ) (cid:88) w ∈ W ( φ ) w t (cid:0) x tφ,w + m tφ,w (cid:1) + (13) = (cid:88) φ ∈ Φ ρ ( φ ; π t ) y tφ (14) = (cid:104) ρ t , y t (cid:105) . (15)Since the strategy π ti is the fixed point of L t generated from link outputs y t , the Blackwell condition (cid:104) ρ t , y t (cid:105) ≤ is satisfied with equality. For proof, see, for example, Greenwald, Li, and Marks (2006).The sum of potential functions after T rounds are then bounded as (cid:88) w ∈ W G (cid:16)(cid:0) x T · ,w + w T ρ T (cid:1) + (cid:17) ≤ (cid:88) w ∈ W G (cid:16)(cid:0) x T · ,w (cid:1) + (cid:17) + γ ( w T ρ T − m T · ,w ) . (16)14xpanding the definition of γ , (cid:88) w ∈ W G (cid:16)(cid:0) x T · ,w + w T ρ T (cid:1) + (cid:17) ≤ (cid:88) w ∈ W G (cid:16)(cid:0) x T · ,w (cid:1) + (cid:17) + 12 (cid:88) w ∈ W (cid:88) φ ∈ Φ (cid:0) w T ρ ( φ ; π T ) − m Tφ,w (cid:1) (17) = (cid:88) w ∈ W G (cid:16)(cid:0) x T · ,w (cid:1) + (cid:17) + 12 (cid:88) φ ∈ Φ ,w ∈ W ( φ ) (cid:0) w T ρ ( φ ; π T ) − m Tφ,w (cid:1) . (18)Unrolling the recursion accross time, = 12 T (cid:88) t =1 (cid:88) φ ∈ Φ ,w ∈ W ( φ ) (cid:0) w t ρ ( φ ; π t ) − m tφ,w (cid:1) . (19)We lower bound (cid:88) w ∈ W G (cid:16)(cid:0) x T +1 · ,w (cid:1) + (cid:17) = 12 (cid:88) w ∈ W (cid:88) φ ∈ Φ (cid:18)(cid:16) x T +1 φ,w (cid:17) + (cid:19) (20) ≥
12 max φ ∈ Φ ,w ∈ W ( φ ) (cid:18)(cid:16) x T +1 φ,w (cid:17) + (cid:19) (21)so that
12 max φ ∈ Φ ,w ∈ W ( φ ) (cid:18)(cid:16) x T +1 φ,w (cid:17) + (cid:19) ≤ T (cid:88) t =1 (cid:88) φ ∈ Φ ,w ∈ W ( φ ) (cid:0) w t ρ ( φ ; π t ) − m tφ,w (cid:1) . (22)Multiplying both sides by two, taking the square root, and applying ρ T ( φ, w ) ≤ (cid:16) x T +1 φ,w (cid:17) + , wearrive at the final bound, max φ ∈ Φ ,w ∈ W ( φ ) ρ T ( φ, w ) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 (cid:88) φ (cid:48) ∈ Φ , ¯ w ∈ W ( φ (cid:48) ) (cid:16) ¯ w t ρ ( φ (cid:48) ; π ti ) − m tφ (cid:48) , ¯ w (cid:17) . (23)Since the bound is true of the worst-case φ ∈ Φ and w ∈ W , it is true of each pair, thereby provingthe claim.If all of the predictions m t are zero, then we arrive at a simple bound for exact regret matching. Weonly prove the bound for ordinary regret matching for simplicity but the result and arguments areidentical for exact regret matching + . Corollary 1.
Given deviation set Φ ⊆ Φ SW S i and finite time selection sets W ( φ ) = { w j ∈ [0 , T } M ( φ ) j =1 for each deviation φ ∈ Φ , (Φ , · + ) -regret matching chooses a strategy on each round ≤ t ≤ T as the fixed point of L t : π i (cid:55)→ / z t (cid:80) φ ∈ Φ φ ( π i ) y tφ or an arbitrary strategy when z t = 0 , where link outputs are generated from exact regrets y tφ = (cid:80) w ∈ W ( φ ) w t ( ρ t − ( φ, w )) + and z t = (cid:80) φ ∈ Φ y tφ . This algorithm ensures that ρ T ( φ, w ) ≤ U (cid:112) M ∗ ω (Φ) T for any devia-tion φ and time selection function w , where ω (Φ) = max a ∈A (cid:80) φ ∈ Φ { φ ( a ) (cid:54) = a } is the maximalactivation of Φ (Greenwald, Li, and Marks 2006).Proof. Since m t = on every round t , we know from Theorem 2 that ρ T ( φ, w ) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 (cid:88) φ (cid:48) ∈ Φ , ¯ w ∈ W ( φ (cid:48) ) ( ¯ w t ρ ( φ (cid:48) ; π ti )) (24) = (cid:118)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 (cid:88) φ (cid:48) ∈ Φ ( ρ ( φ (cid:48) ; π ti )) (cid:88) ¯ w ∈ W ( φ (cid:48) ) ( ¯ w t ) . (25)15ince ≤ ¯ w t ≤ , ≤ (cid:118)(cid:117)(cid:117)(cid:116) M ∗ T (cid:88) t =1 (cid:88) φ (cid:48) ∈ Φ ( ρ ( φ (cid:48) ; π ti )) . (26)Since (cid:80) φ (cid:48) ∈ Φ ( ρ ( φ (cid:48) ; π ti )) ≤ (2 U ) ω (Φ) (see Greenwald, Li, and Marks (2006)), ≤ (cid:112) M ∗ (2 U ) ω (Φ) T (27) = 2 U (cid:112) M ∗ ω (Φ) T . (28)(29)This completes the argument.If x t + m t is generated from a function attempting to approximate x t + [ w t ρ ( φ ; π t )] φ ∈ Φ ,w ∈ W , thenwe can rewrite Theorem 2 in terms of its approximation error. Corollary 2.
Establish deviation set Φ ⊆ Φ SW S i and finite time selection sets W ( φ ) = { w ∈ [0 , T } M ( φ ) j =1 for each deviation φ ∈ Φ . On each round ≤ t ≤ T , approximate (Φ , · + ) -regretmatching with respect to matrix x t (equal to either ρ t − or q t − ) chooses its strategy, π ti ∈ Π i ,to be the fixed point of L t : π i (cid:55)→ / z t (cid:80) φ ∈ Φ φ ( π i ) y tφ or an arbitrary strategy when z t = 0 , wherelink outputs are generated from approximation matrix (cid:101) y t ∈ R | Φ |×| W | as y tφ = (cid:80) w ∈ W ( φ ) w t ( (cid:101) y tφ,w ) + and z t = (cid:80) φ ∈ Φ y tφ . This algorithm ensures that ρ T ( φ, w ) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 (cid:88) φ (cid:48) ∈ Φ , ¯ w ∈ W ( φ (cid:48) ) (cid:16) x tφ (cid:48) , ¯ w + ¯ w t ρ ( φ (cid:48) ; π t ) − (cid:101) y tφ (cid:48) , ¯ w (cid:17) for every deviation φ and time selection function w .Proof. Since the predictions m t are arbitrary, we can set (cid:101) y t = x t + m t , which implies that m t = (cid:101) y t − x t . Substituting this into the bound of Theorem 2, we arrive at the desired result. C EFR
EFR’s regret decomposition is a straightforward generalization of CFR’s by Zinkevich et al. (2007a).A few preliminary definitions are required before stating the results.The action taken to reach a given information set from its parent is returned by a : I (cid:48) (cid:55)→ a → I (cid:48) p ( I (cid:48) ) (“blackboard a”). Let the child information sets of information set I after taking action a be I i ( I, a ) = (cid:40) I (cid:48) ∈ I i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∀ h (cid:48) ∈ I (cid:48) , ∃ h ∈ I, h (cid:48) (cid:119) ha, (cid:64) h (cid:48)(cid:48) ∈ H i , ha (cid:118) h (cid:48)(cid:48) (cid:64) h (cid:48) (cid:41) . (30)Let the histories that terminate without further input from player i after taking action a in I be Z i ( I, a ) = (cid:40) z ∈ Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∃ h ∈ I, z (cid:119) ha, (cid:64) h (cid:48) ∈ H i , ha (cid:118) h (cid:48) (cid:64) z (cid:41) . (31)We define a generalized counterfactual value function that is convenient for working with behavioraldeviations as v I : π i ; π − i (cid:55)→ (cid:88) a ∈A ( I ) π i ( a | I ) (cid:88) h ∈ I,z ∈Z P ( h ; π − i ) P ( ha, z ; π i , π − i ) u i ( z ) , Then, given that φ aI ∈ Φ EX A ( I ) is the external transformation to action a , the conventional counterfac-tual value of action a at I is v I ( φ aI ( π i ); π − i ) = (cid:80) h ∈ I,z ∈Z P ( h ; π − i ) P ( ha, z ; π ) u i ( z ) . By splitting up16he histories that lead out of I into those that terminate without further input from i and those thatlead to to child information sets, we can decompose counterfactual values recursively: v I ( φ aI ( π i ); π − i ) = (cid:88) h ∈ I,z ∈Z P ( h ; π − i ) P ( ha, z ; π ) u ( z ) (32) = (cid:88) h ∈ I,z ∈Z P ( ha, z ; π i ) P ( z ; π − i ) u ( z ) (cid:124) (cid:123)(cid:122) (cid:125) Terminal counterfactual values. (33) = (cid:88) z ∈Z i ( I,a ) P ( z ; π − i ) u ( z ) (cid:124) (cid:123)(cid:122) (cid:125) Expected value from terminal histories. + (cid:88) h (cid:48) ∈ I (cid:48) ∈I i ( I,a ) z ∈Z P ( h (cid:48) , z ; π i ) P ( z ; π − i ) u ( z ) . (cid:124) (cid:123)(cid:122) (cid:125) Expected value from non-terminal histories. (34)If we define r ( I, a ; π − i ) = (cid:80) z ∈Z i ( I,a ) P ( z ; π − i ) u ( z ) , then = r ( I, a ; π − i ) + (cid:88) I (cid:48) ∈I i ( I,a ) (cid:88) a (cid:48) ∈A ( I (cid:48) ) π i ( a (cid:48) | I (cid:48) ) (cid:88) h (cid:48) ∈ I (cid:48) z ∈Z P ( h (cid:48) a (cid:48) , z ; π i ) P ( z ; π − i ) u ( z ) . (cid:124) (cid:123)(cid:122) (cid:125) v I (cid:48) ( π i ; π − i ) (35) = r ( I, a ; π − i ) (cid:124) (cid:123)(cid:122) (cid:125) Expected immediate value. + (cid:88) I (cid:48) ∈I i ( I,a ) v I (cid:48) ( π i ; π − i ) (cid:124) (cid:123)(cid:122) (cid:125) Expected future value. . (36)Full Φ -regret is the incentive to continue a deviation instead of re-correlating upon reaching infor-mation set I . More precisely, it is the regret from I for deviating to reach I and then following theunmodified strategy to the end of the game instead of deviating for the entire game. Critically, atthe beginning of the game, full Φ -regret is exactly the total benefit of the deviation compared tothe learner’s unmodified decisions. Thus, hindsight rationality is achieved as long as full Φ -regretgrows sublinearly at every information set. If each action transformation φ I ∈ φ is an external trans-formation, then full Φ -regret reduces exactly to full counterfactual regret. In this way, full Φ -regretgeneralizes full counterfactual regret to behavioral deviations in the same way that normal-form Φ -regret (Greenwald, Jafari, and Marks 2003) generalizes external regret. Formally, the immediate full Φ -regret of behavioral deviation φ at information set I ∈ I i is ρ I ( φ ; π ) = Prob. φ ( π i ) plays to I . (cid:122) (cid:125)(cid:124) (cid:123) P ( h ( I ); φ ≺ I ( π i )) v I ( φ (cid:23) I ( π i ); π − i ) (cid:124) (cid:123)(cid:122) (cid:125) Full deviation value. − Prob. φ ( π i ) plays to I . (cid:122) (cid:125)(cid:124) (cid:123) P ( h ( I ); φ ≺ I ( π i )) v I ( π i ; π − i ) . (cid:124) (cid:123)(cid:122) (cid:125) Value of immediate re-correlation. (37)Thus, full Φ -regret is regret under the weighted counterfactual value function that we could perhapscall a Φ -counterfactual value function , v φ ≺ I I : π i ; π − i (cid:55)→ P ( h ( I ); φ ≺ I ( π i )) v I ( π i ; π − i ) , (38) i.e ., ρ I ( φ ; π ) = v φ ≺ I I ( φ (cid:23) I ( π i ); π − i ) − v φ ≺ I I ( π i ; π − i ) (39)Immediate Φ -regret can also be written as regret under a Φ -counterfactual value function: ρ I ( φ (cid:22) I ; π ) = P ( h ( I ); φ ≺ I ( π i )) ρ CF I ( φ I ; π ) (40) = P ( h ( I ); φ ≺ I ( π i ))( v I ( φ I ( π i ); π − i ) − v I ( π i ; π − i )) (41) = P ( h ( I ); φ ≺ I ( π i )) v I ( φ I ( π i ); π − i ) − P ( h ( I ); φ ≺ I ( π i )) v ( π i ; π − i ) (42) = v φ ≺ I I ( φ I ( π i ); π − i ) − v φ ≺ I I ( π i ; π − i ) . (43)We can now state our decomposition result: 17 emma 1. Let
Φ = { Φ I ⊆ Φ SW A ( I ) } I ∈I i ⊆ Φ SW I i be a subset of behavioral deviations. Then the full Φ -regret at information set I of deviation φ ∈ Φ is bounded by the immediate Φ -regret at I plus thefull Φ -regrets at each of I ’s children, i.e ., ρ TI ( φ ) = ρ TI ( φ (cid:22) I ) + (cid:88) I (cid:48) ∈ (cid:83) a ∈A ( I ) I i ( I,a ) ρ TI (cid:48) ( φ ) . (44) Proof.
The steps of this proof largely mirror those used in Zinkevich et al. (2007a)’s Lemma 5. ρ TI ( φ ) = T (cid:88) t =1 v φ ≺ I I ( φ (cid:23) I ( π ti ); π t − i ) − v φ ≺ I I ( π ti ; π t − i ) . (45)Adding and subtracting v φ ≺ I I ( φ I ( π ti ); π t − i ) , = T (cid:88) t =1 v φ ≺ I I ( φ I ( π ti ); π t − i ) − v φ ≺ I I ( π ti ; π t − i ) (cid:124) (cid:123)(cid:122) (cid:125) Immediate Φ -regret, ρ I ( φ (cid:22) I ; π t ) . (46) + T (cid:88) t =1 v φ ≺ I I ( φ (cid:23) I ( π ti ); π t − i ) − v φ ≺ I I ( φ I ( π ti ); π t − i ) . (cid:124) (cid:123)(cid:122) (cid:125) Regret for re-correlating after I .Both terms apply transformation φ I . (47)Expanding the definition of v φ ≺ I I in the difference of the second sum, v φ ≺ I I ( φ (cid:23) I ( π ti ); π t − i ) − v φ ≺ I I ( φ I ( π ti ); π t ) (48) = P ( h ( I ); φ ≺ I ( π ti )) (cid:88) a ∈A ( I ) [ φ I π ti ]( a | I ) (cid:122) (cid:125)(cid:124) (cid:123) r ( I, a ; π t − i ) − r ( I, a ; π t − i )+ (cid:88) I (cid:48) ∈I i ( I,a ) v I (cid:48) ( φ (cid:31) I ( π ti ); π t − i ) − v I (cid:48) ( π ti ; π t − i ) . If p ( I (cid:48) ) = I and a ( I (cid:48) ) = a , then P ( h ( I (cid:48) ); φ ≺ I (cid:48) ( π ti )) = P ( h ( I ); φ ≺ I ( π ti ))[ φ I π ti ]( a | I ) . Therefore, = (cid:88) a ∈A ( I ) (cid:88) I (cid:48) ∈I i ( I,a ) P ( h ( I ); φ ≺ I ( π ti ))[ φ I π ti ]( a | I ) (cid:124) (cid:123)(cid:122) (cid:125) P ( h ( I (cid:48) ); φ ≺ I (cid:48) ( π ti )) (cid:0) v I (cid:48) ( φ (cid:31) I ( π ti ); π t − i ) − v I (cid:48) ( π ti ; π t − i ) (cid:1) (49) = (cid:88) I (cid:48) ∈ (cid:83) a ∈A ( I ) I i ( I,a ) P ( h ( I (cid:48) ); φ ≺ I (cid:48) ( π ti )) (cid:0) v I (cid:48) ( φ (cid:31) I ( π ti ); π t − i ) − v I (cid:48) ( π ti ; π t − i ) (cid:1) (50) = (cid:88) I (cid:48) ∈ (cid:83) a ∈A ( I ) I i ( I,a ) v φ ≺ I (cid:48) I (cid:48) ( φ (cid:23) I (cid:48) ( π ti ); π t − i ) − v φ ≺ I (cid:48) I (cid:48) ( π ti ; π t − i ) (51) = (cid:88) I (cid:48) ∈ (cid:83) a ∈A ( I ) I i ( I,a ) ρ I (cid:48) ( φ ; π t ) , (52)where the last two lines follow from the definitions of v φ ≺ I (cid:48) I (cid:48) and ρ I (cid:48) .Finally, we can substitute Eq. (52) back into Eq. (37) to arrive at the desired decomposition: ρ TI ( φ ) = T (cid:88) t =1 ρ I ( φ (cid:22) I ; π t ) + (cid:88) I (cid:48) ∈ (cid:83) a ∈A ( I ) I i ( I,a ) T (cid:88) t =1 ρ I (cid:48) ( φ ; π t ) (53) = ρ TI ( φ (cid:22) I ) + (cid:88) I (cid:48) ∈ (cid:83) a ∈A ( I ) I i ( I,a ) ρ TI (cid:48) ( φ ) . (54)18sing Lemma 1 and instantiating EFR with exact regret matching, we can derive a simple regretbound that depends on the number of immediate Φ -regrets associated with a given subset of behav-ioral deviations. Theorem 1.
Instantiate EFR for player i with exact regret matching and deviations Φ defined byvalid transformation sequences and action transformations, { Φ (cid:22) I ⊆ Φ SW (cid:22) I , Φ I ⊆ Φ SW A ( I ) } I ∈I i . Let C I = max φ I ∈ Φ I (cid:80) φ ≺ I ∈ Φ (cid:22) p ( I ) { φ ≺ I φ I ∈ Φ (cid:22) I } be the maximum number of transformation se-quences that are valid for any immediate transformation at any information set after the start ofthe game and C I = 1 otherwise. Let D = max I C I ω (Φ I ) . Then, EFR’s cumulative regret after T rounds with respect to Φ is upper bounded by U |I i |√ DT .Proof. EFR keeps track of each immediate Φ -regret for each deviation in each information set I .The immediate strategies at each I on each round are chosen according to regret matching on thecumulative immediate Φ -regret there. Therefore, the cumulative immediate Φ -regret at each infor-mation set is bounded as ρ TI ( φ (cid:22) I ) ≤ U √ DT according to Corollary 1. Using this bound andunrolling the decomposition relationship of Lemma 1 from the beginning of the game to the end,we see that the full Φ -regret at each information set is bounded as ρ TI ( φ ) ≤ U |I i |√ DT . EFR’scumulative regret with respect to Φ is equal to its cumulative full Φ -regret at the start of the game,so the former is bounded by U |I i |√ DT as well, which concludes the argument.See Table C.1 for EFR instantiations with each partial sequence deviation type.The variable D in the EFR regret bound that depends on the particular behavioral deviation subsetwith which it is instantiated is often the number of immediate Φ -regrets generated by that subsetdivided by the number of information sets. D is slightly larger for CSPS because it uses the unionof internal and external transformations for its action transformation set, Φ I , at all information setsexcept those at the beginning of the game. Since C I counts the maximum number of valid incomingtransformation sequences for any action transformation and ω (Φ I ) counts the maximum number ofprobability mass movements done by Φ I for any action, their product ends up being larger for CSPSthan the number of valid combinations between incoming transformation sequences and I ’s actiontransformations. See Table C.1 for D values corresponding to each partial sequence deviation type. D Regret Matching++
Kash, Sullins, and Hofmann (2020) presents the regret matching++ algorithm and claims that it isno-external-regret. This algorithm’s proposed regret bound implies a sublinear bound on cumulativepositive regret, which would further imply that it has the same bound with respect to all possible time selection functions. The surprising aspect of this result is that the algorithm does not require anyinformation about any of the possible time selection functions and requires no more computation orstorage than basic regret matching. The following result, Theorem 3, shows that there is actually noalgorithm that can achieve a sublinear bound on cumulative positive regret. This result proves thatregret matching++ cannot be no-external-regret as claimed. Appendix D.2 identifies the mistake inthe regret matching++ bound proof.
D.1 Linear Lower Bound on the Sum of Positive RegretsTheorem 3.
The worst-case maximum cumulative positive regret, Q T =max a ∈A (cid:80) Tt =1 ( r t ( a ) − (cid:104) π t , r t (cid:105) ) + , under a sequence of reward functions chosen from theclass of bounded reward functions, ( r t ∈ { r : r ∈ R |A| , (cid:107) r (cid:107) ∞ ≤ } ) Tt =1 , of any algorithm thatchooses policies π t ∈ ∆ |A| over a finite set of actions, A , in an online fashion over T rounds, is atleast T / .Proof. Without loss of generality, consider a two action environment, A = ( a, a (cid:48) ) , and any learningalgorithm that deterministically chooses a distribution, π t ∈ ∆ , over them on each round t . The en-vironment gets to see the learner’s policy before presenting a reward function. If the learner weightsone action more than the other, the environment gives a reward of zero for the action with the largerweight and one to the action with the smaller weight. Formally, if π t ( a ) ≥ . , then r t ( a ) = 0 , r t ( a (cid:48) ) = 1 , and vice-versa otherwise. 19able C.1: Partial sequence deviation definitions and EFR parameters ( { Φ (cid:22) I , Φ I } I ∈I i ), along withthe corresponding D parameter in EFR’s regret bound. Let the superscript in φ aI denote that this isthe external transformation to action a and similarly, φ a → a (cid:48) I is the internal transformation from a to a (cid:48) . Let φ ≺ I (cid:48) = φ . . . φ be the sequence of identity transformations at information sets beforeinformation set I (cid:48) and φ → I (cid:48) (cid:23) I = φ a → I (cid:48) I I . . . φ a ( I (cid:48) ) p ( I (cid:48) ) be the sequence external transformations that playfrom I to I (cid:48) . Enumerate the information sets and actions between the start of the game and I bydepth as ( I j , a j = a → II j ) ≤ j
See Table E.2 for EFR instantiations with non-partial sequence deviation types used in experiments( i.e ., behavioral, informed action, informed counterfactual, and blind counterfactual).
E.1 GamesLeduc Hold’em Poker
Leduc hold’em poker (Southey et al. 2005) is a two-player poker game with a deck of six cards (twosuits and three ranks). At the start of the game, both players ante one chip and receive one privatecard. There are two betting rounds and there is a maximum of two raises on each round. Bet sizesare limited to two chips in the first round and four in the second. If one player folds, the other wins.At the start of the second round, a public card is revealed. A showdown occurs at the end of thesecond round if no player folds. The strongest hand in a showdown is a pair (using the public card),and if no player pairs, players compare the ranks of their private cards. The player with the strongerhand takes all chips in the pot or players split the pot if their hands have the same strength. Payoffsare reported in milli-big blinds (mbb) (where the ante is considered a big blind) for consistency withthe way performance is reported in other poker games.
Imperfect Information Goofspiel
Imperfect information goofspiel (Ross 1971; Lanctot 2013) is a bidding game for N players. Eachplayer is given a hand of n ranks that they play to bid on n point cards. On each round, one pointcard is revealed and each player simultaneously bids on the point card. The point cards might besorted in ascending order ( ↑ ), descending order ( ↓ ), or they might be shuffled ( R ). If there is onebid that is greater than all the others, the player who made that bid wins the point card. If there is adraw, the bid card is instead discarded. The player with the most points wins so payoffs are reportedin win percentage. We use five goofspiel variants:• two-player, 5-ranks, ascending (goofspiel (5 , ↑ , N = 2) ),• two-player, 5-ranks, descending (goofspiel (5 , ↓ , N = 2) ),• two-player, 4-ranks, random (goofspiel (4 , R, N = 2) ),• three-player, 4-ranks, ascending (goofspiel (4 , ↑ , N = 3) ), and• three-player, 4-ranks, descending (goofspiel (4 , ↓ , N = 3) ).21able E.2: Behavioral, informed action, informed counterfactual, and blind counterfactual deviationdefinitions (where behavioral deviations are defined with internal transformations) and EFR param-eters ( { Φ (cid:22) I , Φ I } I ∈I i ), along with the corresponding D parameter in EFR’s regret bound. Let thesuperscript in φ aI denote that this is the external transformation to action a and similarly, φ a → a (cid:48) I isthe internal transformation from a to a (cid:48) . Let φ ≺ I (cid:48) = φ . . . φ be the sequence of identity transfor-mations at information sets before information set I (cid:48) and φ → I (cid:48) (cid:23) I = φ a → I (cid:48) I I . . . φ a ( I (cid:48) ) p ( I (cid:48) ) be the sequenceexternal transformations that play from I to I (cid:48) . Enumerate the information sets and actions betweenthe start of the game and I by depth as ( I j , a j = a → II j ) ≤ j Sheriff is a two-player, non-zero-sum negotiation game resembling the Sheriff of Nottingham boardgame and it was introduced by Farina et al. (2019b). At the beginning of the game, the “smuggler”player chooses zero or more illegal items (maximum of three) to add to their cargo. The rest of thegame proceeds over four rounds.At the beginning of each round, the smuggler signals how much they would be willing to pay the“sheriff” player to bribe them into not inspecting the smuggler’s cargo, between zero and three. Thesheriff responds by signalling whether or not they would inspect the cargo. On the last round, thebribe amount chosen by the smuggler and the sheriff’s decision about whether or not to inspect thecargo are binding.If the cargo is not inspected, then the smuggler receives a payoff equal to the number of illegal itemsincluded within, minus their bribe amount, and the sheriff receives the bribe amount. Otherwise, thesheriff inspects the cargo. If the sheriff finds an illegal item, then the sheriff forces the smuggler topay them two times the number of illegal items. Otherwise, the sheriff compensates the smuggler bypaying them three. Tiny Bridge A miniature version of bridge created by Edward Lockhart, inspired by a research project at Uni-versity of Alberta by Michael Bowling, Kate Davison, and Nathan Sturtevant. We use the smallertwo-player rather than the full four-player version. See the implementation from Lanctot et al. (2019)for more details. 22 inay Hanabi A miniature two-player version of Hanabi described by Foerster et al. (2019). The game is fullycooperative and the optimal score is ten. Both players take only one action so all EFR instancescollapse except when they differ in their choice of Φ I . E.2 Alternative Φ I Choices When implementing EFR for deviations that set the action transformations at each information set tothe internal transformations, we have the option of implementing these variants by using the unionof the internal and external transformations without substantially changing the variant’s theoreti-cal properties. We test how this impacts practical performance within EFR variants for informedcounterfactual deviations, CFPS deviations, and TIPS deviations. These variants have an “ EX + IN ”subscript. E.3 Results We present three sets of figures to summarize the performance of each EFR variant in the fixed andsimultaneous regimes described in Section 7. Figs. E.1 and E.3 show the running average expectedpayoff of each variant over time, averaged over play with all EFR variants (including itself). Thesefigures summarize the progress that each variant makes over time to adapt to and correlate with itscompanion variant, on average. Figs. E.2 and E.4 show the instantaneous expected payoff of eachvariant over time, averaged over play with all EFR variants. These figures illustrate how each variantperforms on average in each round individually. Fig. E.6 show the average expected payoff of eachvariant paired with each other variant (including itself) after 500 rounds. These figures illustrate howwell each variant works with each other variant. F On Informed Causal Deviations and Trigger Actions An informed causal deviation is defined as deploying an alternative strategy from a trigger infor-mation set to the end of the game only if the input strategy plays a trigger action in the triggerinformation set (Dud´ık and Gordon 2009). Because information about the strategy at the trigger in-formation set is used to determine whether or not the deviation transforms immediate strategies afterthis information set, informed causal deviations are not behavioral deviations. This structure derivesfrom the reduced-form strategy assumption mentioned in Section 3.Assuming pure reduced-form strategies, transforming a ! to a (cid:48) in information set I ! ensures that ifthe input strategy plays a ! , then no actions will be assigned to any of the information sets following a (cid:48) . Because the strategy is undefined after this transformation, we can assume that they are set tofixed actions that maximize the value for the deviation. But if the strategy being transformed actuallytakes action a (cid:48) , then there must be a strategy after a (cid:48) to follow. So technically, the deviation functionwe’ve just described consists only of an internal transformation at I ! .What happens when we no longer assume that strategies are in reduced-form? In order to accu-rately reproduce the behavior of the deviations we just described, an informed causal deviation musttransform the immediate strategies at information sets after I ! according to the weight that the inputstrategy puts on action a ! , but not apply these transformations according to the weight the inputstrategy puts on a (cid:48) itself. This leads to the conventional definition of informed causal deviationsfrom Dud´ık and Gordon (2009).What implication does this have for CSPS and TIPS deviations? Essentially nothing precisely be-cause they allow re-correlation. A CSPS or TIPS deviation is a behavioral deviation, and therefore,must apply each action transformation it assigns to each information set. If the input strategy chooses a (cid:48) , a CSPS or TIPS deviation will apply all transformations at subsequent information sets becausethere is no mechanism for transmitting the information that a ! was not played to action transforma-tions assigned to successors. However, because there are CSPS and TIPS deviations that re-correlateimmediately after an internal transformation at any trigger information set, the full set of CSPS andTIPS deviations still capture all of the power of informed causal deviations.23 100 200 300 400 500 − − a v g . p a y o ff ( m bb ) Leduc hold’em( N = 2) ACT IN CF CF IN CF EX+IN − − BPSCFPSCSPS TIPSBHV − − CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(5 , ↓ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(5 , ↑ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(4 , R, N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(4 , ↓ , N = 3) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(4 , ↑ , N = 3) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . p a y o ff Sheriff( N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . . a v g . p a y o ff tiny bridge( N = 2) ACT IN CF CF IN CF EX+IN . . . . . BPSCFPSCSPS TIPSBHV . . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . . a v g . p a y o ff tiny Hanabi( N = 2) CFCF IN CF EX+IN Figure E.1: The expected payoff accumulated by each EFR variant over time averaged over playwith all EFR variants in each game in the fixed regime.24 100 200 300 400 500 − − p a y o ff ( m bb ) Leduc hold’em( N = 2) ACT IN CF CF IN CF EX+IN − − BPSCFPSCSPS TIPSBHV − − CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(5 , ↓ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(5 , ↑ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(4 , R, N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(4 , ↓ , N = 3) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(4 , ↑ , N = 3) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN − . . . . p a y o ff Sheriff( N = 2) ACT IN CF CF IN CF EX+IN − . . . . BPSCFPSCSPS TIPSBHV − . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . . p a y o ff tiny bridge( N = 2) ACT IN CF CF IN CF EX+IN . . . . . BPSCFPSCSPS TIPSBHV . . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . . p a y o ff tiny Hanabi( N = 2) CFCF IN CF EX+IN Figure E.2: The instantaneous payoff achieved by each EFR variant on each round averaged overplay with all EFR variants in each game in the fixed regime.25 100 200 300 400 500 − . − . . . a v g . p a y o ff ( m bb ) Leduc hold’em( N = 2) ACT IN CF CF IN CF EX+IN − . − . . . BPSCFPSCSPS TIPSBHV − . − . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(5 , ↓ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(5 , ↑ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(4 , R, N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(4 , ↓ , N = 3) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . w i n % goofspiel(4 , ↑ , N = 3) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN − . . . . a v g . p a y o ff Sheriff( N = 2) ACT IN CF CF IN CF EX+IN − . . . . BPSCFPSCSPS TIPSBHV − . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . . a v g . p a y o ff tiny bridge( N = 2) ACT IN CF CF IN CF EX+IN . . . . . BPSCFPSCSPS TIPSBHV . . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . a v g . p a y o ff tiny Hanabi( N = 2) CFCF IN CF EX+IN Figure E.3: The expected payoff accumulated by each EFR variant over time averaged over playwith all EFR variants in each game in the simultaneous regime.26 100 200 300 400 500 − − − p a y o ff ( m bb ) Leduc hold’em( N = 2) ACT IN CF CF IN CF EX+IN − − − BPSCFPSCSPS TIPSBHV − − − CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(5 , ↓ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(5 , ↑ , N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(4 , R, N = 2) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN w i n % goofspiel(4 , ↓ , N = 3) ACT IN CF CF IN CF EX+IN BPSCFPSCSPS TIPSBHV CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . w i n % goofspiel(4 , ↑ , N = 3) ACT IN CF CF IN CF EX+IN . . . . BPSCFPSCSPS TIPSBHV . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN − . . . . p a y o ff Sheriff( N = 2) ACT IN CF CF IN CF EX+IN − . . . . BPSCFPSCSPS TIPSBHV − . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . . p a y o ff tiny bridge( N = 2) ACT IN CF CF IN CF EX+IN . . . . . BPSCFPSCSPS TIPSBHV . . . . . CFPSCFPS EX+IN TIPSTIPS EX+IN . . . . p a y o ff tiny Hanabi( N = 2) CFCF IN CF EX+IN Figure E.4: The instantaneous payoff achieved by each EFR variant on each round averaged overplay with all EFR variants in each game in the simultaneous regime.27 e du c h o l d ’ e m ( N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg -0.0 -25.3 -1.5 -15.6 -6.2 -27.3 -19.2 -25.3 -17.0 -17.6 -6.7 -14.71951.7 -0.0 7.8 -0.8 3.4 2.1 1.3 1.1 6.0 2.2 10.8 180.51951.0 2.7 -0.0 14.1 5.6 -5.4 -5.8 2.7 6.7 1.5 5.8 179.91951.9 4.3 6.2 -0.0 5.2 1.5 0.3 2.9 6.6 2.1 11.8 181.21951.6 76.0 90.3 61.7 -0.0 23.5 1.7 11.4 10.0 9.3 15.8 204.71951.0 55.5 85.8 86.0 39.4 -0.0 30.4 40.8 5.5 57.0 22.9 215.91951.9 67.3 98.1 79.0 7.4 10.4 -0.0 13.5 9.4 8.4 17.8 205.81951.9 68.4 59.7 96.3 59.3 63.2 96.3 -0.0 17.8 47.1 44.9 227.71951.0 91.5 67.1 124.5 78.0 69.9 84.9 21.5 0.0 60.1 36.1 235.01952.0 75.0 58.6 83.9 62.2 76.3 88.4 17.3 17.4 -0.0 41.5 224.81950.9 129.4 113.3 185.4 76.2 66.6 137.1 53.0 68.7 70.0 0.0 259.11774.1 49.5 53.2 65.0 30.0 25.5 37.8 12.6 11.9 21.8 18.2 190.9fixed regime ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg -0.0 -221.8 -319.1 -235.7 -248.3 -331.8 -359.7 -253.7 -347.7 -372.6 -321.6 -273.8221.8 -0.0 -1.7 -2.6 -6.9 -5.0 -12.1 -12.1 -11.2 -10.9 -17.6 12.9319.1 1.7 -0.0 1.0 -11.6 -11.4 -9.7 -14.3 -11.1 -13.0 -11.6 21.7235.7 2.6 -1.0 -0.0 -6.7 -8.5 -9.0 -13.7 -6.2 -13.3 -13.0 15.2248.3 6.9 11.6 6.7 -0.0 -1.2 1.5 -5.8 -5.3 -8.1 -3.7 22.8331.8 5.0 11.4 8.5 1.2 -0.0 2.0 -10.4 -6.4 -4.3 -8.2 30.1359.7 12.1 9.7 9.0 -1.5 -2.0 -0.0 -2.0 -2.7 -3.0 -2.5 34.3253.7 12.1 14.3 13.7 5.8 10.4 2.0 -0.0 2.9 -3.9 -4.1 27.9347.7 11.2 11.1 6.2 5.3 6.4 2.7 -2.9 0.0 -4.1 -3.6 34.6372.6 10.9 13.0 13.3 8.1 4.3 3.0 3.9 4.1 -0.0 1.1 39.5321.6 17.6 11.6 13.0 3.7 8.2 2.5 4.1 3.6 -1.1 0.0 35.0273.8 -12.9 -21.7 -15.2 -22.8 -30.1 -34.3 -27.9 -34.6 -39.5 -35.0 0.0simultaneous regime goo f s p i e l ( , ↓ , N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg goo f s p i e l ( , ↑ , N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg goo f s p i e l ( , R , N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg goo f s p i e l ( , ↓ , N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg − − . . . . . . . . . . . 26 5060708090 Figure E.5: (1 / 2) The average expected payoff accumulated by each EFR variant (listed by row)from playing with each other EFR variant (listed by column) in each game after 500 rounds.28 oo f s p i e l ( , ↑ , N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg Sh e r i ff ( N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg t i n y b r i d g e ( N = ) ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg ACT IN CF CF IN CF EX+IN BPS CFPS CFPS EX+IN CSPS TIPS TIPS EX+IN BHV avgACT IN CFCF IN CF EX+IN BPSCFPSCFPS EX+IN CSPSTIPSTIPS EX+IN BHVavg t i n y H a n a b i ( N = ) CF CF IN CF EX+IN avgCFCF IN CF EX+IN avg CF CF IN CF EX+IN avgCFCF IN CF EX+IN avg . . . . . 25 0 . . . . . . . . . . . 700 20 . . . . . . . . . . . . . . . Figure E.6: (2 / 2) The average expected payoff accumulated by each EFR variant (listed by row)from playing with each other EFR variant (listed by column) in each game after 500 rounds.29 eferences Aumann, R. J. 1974. Subjectivity and correlation in randomized strategies. Journal of MathematicalEconomics Journal of Machine LearningResearch Proceedings of the 36th International Conference on Machine Learning (ICML-19) , 793–802.Burch, N. 2017. Time and space: Why imperfect information games are hard . Ph.D. thesis, Univer-sity of Alberta.Burch, N.; Lanctot, M.; Szafron, D.; and Gibson, R. 2012. Efficient Monte Carlo counterfactualregret minimization in games with many player actions. In Advances in Neural Information Pro-cessing Systems , 1880–1888.Celli, A.; Marchesi, A.; Farina, G.; and Gatti, N. 2020. No-regret learning dynamics for extensive-form correlated equilibrium. Advances in Neural Information Processing Systems International Conference on Machine Learning , 2392–2401. PMLR.D’Orazio, R. 2020. Regret Minimization with Function Approximation in Extensive-Form Games .Master’s thesis, University of Alberta.D’Orazio, R.; and Huang, R. 2021. Optimistic and Adaptive Lagrangian Hedging. In ReinforcementLearning in Games Workshop at the Thirty-Fifth AAAI Conference on Artificial Intelligence .D’Orazio, R.; Morrill, D.; Wright, J. R.; and Bowling, M. 2020. Alternative Function Approxi-mation Parameterizations for Solving Games: An Analysis of f -Regression Counterfactual RegretMinimization. In Proceedings of The Nineteenth International Conference on Autonomous Agentsand Multi-Agent Systems . International Foundation for Autonomous Agents and Multiagent Sys-tems.Dud´ık, M.; and Gordon, G. J. 2009. A Sampling-Based Approach to Computing Equilibria inSuccinct Extensive-Form Games. In Proceedings of the 25th Conference on Uncertainty in ArtificialIntelligence (UAI-2009) .Farina, G.; Kroer, C.; Brown, N.; and Sandholm, T. 2019a. Stable-Predictive Optimistic Counter-factual Regret Minimization. In International Conference on Machine Learning , 1853–1862.Farina, G.; Kroer, C.; and Sandholm, T. 2020. Faster Game Solving via Predictive Blackwell Ap-proachability: Connecting Regret Matching and Mirror Descent. arXiv preprint arXiv:2007.14358 .Farina, G.; Ling, C. K.; Fang, F.; and Sandholm, T. 2019b. Correlation in Extensive-Form Games:Saddle-Point Formulation and Benchmarks. In Conference on Neural Information Processing Sys-tems (NeurIPS) .Foerster, J.; Song, F.; Hughes, E.; Burch, N.; Dunning, I.; Whiteson, S.; Botvinick, M.; and Bowling,M. 2019. Bayesian action decoder for deep multi-agent reinforcement learning. In InternationalConference on Machine Learning , 1942–1951. PMLR.Forges, F.; and von Stengel, B. 2002. Computionally Efficient Coordination in Games Trees.THEMA Working Papers 2002-05, THEMA (TH´eorie Economique, Mod´elisation et Applications),Universit´e de Cergy-Pontoise.Foster, D. P.; and Vohra, R. 1999. Regret in the on-line decision problem. Games and EconomicBehavior Regret Minimization in Games and the Development of Champion MultiplayerComputer Poker-Playing Agents . Ph.D. thesis, University of Alberta.30ibson, R.; Lanctot, M.; Burch, N.; Szafron, D.; and Bowling, M. 2012. Generalized Sampling andVariance in Counterfactual Regret Minimization. In Proceedings of the Twenty-Sixth Conference onArtificial Intelligence (AAAI-12) , 1355–1361.Gordon, G. J. 2005. No-regret algorithms for structured prediction problems. Technical report,CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE.Greenwald, A.; Jafari, A.; and Marks, C. 2003. A general class of no-regret learning algorithms andgame-theoretic equilibria. In Proceedings of the 2003 Computational Learning Theory Conference ,1–11.Greenwald, A.; Li, Z.; and Marks, C. 2006. Bounds for Regret-Matching Algorithms. In ISAIM .Hart, S.; and Mas-Colell, A. 2000. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica Proceedings of theEleventh International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS) .Kash, I. A.; Sullins, M.; and Hofmann, K. 2020. Combining no-regret and Q-learning. In Proceed-ings of The Nineteenth International Conference on Autonomous Agents and Multi-Agent Systems .International Foundation for Autonomous Agents and Multiagent Systems.Kuhn, H. W. 1953. Extensive Games and the Problem of Information. Contributions to the Theoryof Games 2: 193–216.Lanctot, M. 2013. Monte Carlo Sampling and Regret Minimization for Equilibrium Computationand Decision-Making in Large Extensive Form Games . Ph.D. thesis, Department of ComputingScience, University of Alberta, Edmonton, Alberta, Canada.Lanctot, M.; Lockhart, E.; Lespiau, J.-B.; Zambaldi, V.; Upadhyay, S.; P´erolat, J.; Srinivasan, S.;Timbers, F.; Tuyls, K.; Omidshafiei, S.; Hennes, D.; Morrill, D.; Muller, P.; Ewalds, T.; Faulkner,R.; Kram´ar, J.; Vylder, B. D.; Saeta, B.; Bradbury, J.; Ding, D.; Borgeaud, S.; Lai, M.; Schrittwieser,J.; Anthony, T.; Hughes, E.; Danihelka, I.; and Ryan-Davis, J. 2019. OpenSpiel: A Framework forReinforcement Learning in Games. CoRR abs/1908.09453. URL http://arxiv.org/abs/1908.09453.Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M. 2009. Monte Carlo Sampling for RegretMinimization in Extensive Games. In Bengio, Y.; Schuurmans, D.; Lafferty, J.; Williams, C. K. I.;and Culotta, A., eds., Advances in Neural Information Processing Systems 22 , 1078–1086.Morrill, D. 2016. Using Regret Estimation to Solve Games Compactly.Morrill, D.; D’Orazio, R.; Sarfati, R.; Lanctot, M.; Wright, J. R.; Greenwald, A.; and Bowling, M.2020. Hindsight and Sequential Rationality of Correlated Play.Rakhlin, S.; and Sridharan, K. 2013. Optimization, learning, and games with predictable sequences.In Advances in Neural Information Processing Systems , 3066–3074.Ross, S. M. 1971. Goofspiel — The game of pure strategy. Journal of Applied Probability Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, 2157–2164.Southey, F.; Bowling, M. H.; Larson, B.; Piccione, C.; Burch, N.; Billings, D.; and Rayner, D. C.2005. Bayes’ Bluff: Opponent Modelling in Poker. In UAI ’05, Proceedings of the 21st Conference inUncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, 2005 , 550–558. URL https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article id=1216&proceeding id=21.Steinberger, E.; Lerer, A.; and Brown, N. 2020. DREAM: Deep regret minimization with advantagebaselines and model-free learning. arXiv preprint arXiv:2006.10410 .31yrgkanis, V.; Agarwal, A.; Luo, H.; and Schapire, R. E. 2015. Fast convergence of regularizedlearning in games. In Advances in Neural Information Processing Systems , 2989–2997.Tammelin, O. 2014. Solving Large Imperfect Information Games Using CFR+. arXiv preprintarXiv:1407.5042 .Tammelin, O.; Burch, N.; Johanson, M.; and Bowling, M. 2015a. Solving Heads-up Limit TexasHold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence .Tammelin, O.; Burch, N.; Johanson, M.; and Bowling, M. 2015b. Solving Heads-up Limit TexasHold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence .von Stengel, B.; and Forges, F. 2008. Extensive-form correlated equilibrium: Definition and com-putational complexity. Mathematics of Operations Research Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence .Waugh, K.; Morrill, D.; Bagnell, J. A.; and Bowling, M. 2015. Solving Games with FunctionalRegret Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence .Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, C. 2007a. Regret Minimization in Gameswith Incomplete Information. Technical Report TR07-14, University of Alberta.Zinkevich, M.; Johanson, M.; Bowling, M. H.; and Piccione, C. 2007b. Regret Minimization inGames with Incomplete Information. In