[PDF] Variance decompositions for extensive-form games

Abstract

Quantitative measures of randomness in games are useful for game design and have implications for gambling law. We treat the outcome of a game as a random variable and derive a closed-form expression and estimator for the variance in the outcome attributable to a player of the game. We analyze poker hands to show that randomness in the cards dealt has little influence on the outcomes of each hand. A simple example is given to demonstrate how variance decompositions can be used to measure other interesting properties of games.

Full PDF

DD R A F T Variance decompositions for extensive-form games

Alex Cloud, Eric LaberNorth Carolina State UniversitySeptember 8, 2020

Abstract

Quantitative measures of randomness in games are useful for game design and have impli-cations for gambling law. We treat the outcome of a game as a random variable and derive aclosed-form expression and estimator for the variance in the outcome attributable to a playerof the game. We analyze poker hands to show that randomness in the cards dealt has little in-ﬂuence on the outcomes of each hand. A simple example is given to demonstrate how variancedecompositions can be used to measure other interesting properties of games.

From game design studios to courtrooms, randomness in games has been the subject of extensivediscussion. Game designers use random game elements to protect players’ egos, increase gameplayvariety, and limit the eﬃcacy of mental calculation [11]. In U.S. state law, the question of whetherPoker is predominantly a game of chance or skill is considered to be central to the legality of onlinePoker [3, 14].The question of how to measure the role of luck versus skill has proved diﬃcult and producedmany answers [5, 4, 11, 14, 6, 7, 18]. For example, in

USA v. Lawrence Dicristina , economicconsultant and high-level amateur poker player Randal Heeb testiﬁed that “statistical analysis ofpoker hands conﬁrms that skill predominates over chance.” His conclusion was based on a seriesof heuristic data analyses combined with intuitive judgments [12]. Others have argued that thestrong association between player skill rating and future earnings constitute strong evidence thatpoker should be considered a game of skill [14, 16].A ﬁrst step in assessing the role of chance in a game is to quantify sources of uncertainty. Weexamine how variation in the outcomes of a game can be attributed to players or chance eventsusing a variance decomposition, a standard statistical method in which the variance of a randomvariable is written as the sum of nonnegative terms corresponding to variation attributable todiﬀerent factors [1]. We express the total variation in game outcomes as the sum of variancecomponents associated with (i) the actions taken by a player of interest, and (ii) all remainingsources of variation. By applying this decomposition to a conceptual “chance player,” we measurethe degree to which randomness inherent in a game biases the results in favor of a given player.We derive an analytical expression for the variance components and use it to obtain estimatorswhich are model-free in the sense that they do not require access to an entire game model or otherplayers’ behavior. Our results apply to ﬁnite extensive-form games in general; they are not limitedto the two-player, zero-sum case. As an illustrative example, we analyze poker hands played by theDeepStack poker AI against professional players [17] and ﬁnd that chance events have very littleinﬂuence on the expected per-hand proﬁt for a player relative to the total variation in per-handproﬁt. 1 a r X i v : . [ s t a t . O T ] S e p R A F T c120 − − s ∈ S and isannotated with the corresponding player, P - ( s ). The dashed line represents Player 2’s informationstate; in this example, they cannot tell what move Player 1 played. Rewards for Player 1 are shownbelow the terminal nodes. An extensive-form game is a tree-based representation of a multi-agent system; Figure 1 displays asimple example. In this representation, the game is played by traversing the tree from the root toa leaf node, with a player’s action at each node determining the next node visited. Our notation isbased on [9] and [15], with some modiﬁcations.Let S denote the set of possible game states which we assume is ﬁnite; each state is associatedwith a node in the game tree. Deﬁne N = { , . . . , n } to be the set of (non-chance) players andlet c denote the chance player. The player function P - : S → N ∪ { c } associates each state witha player. At each state s ∈ S , there are a ﬁnite number of available actions A ( s ), such that each a ∈ A ( s ) uniquely determines the next state visited in the tree [8].A sequence of actions z = ( a , . . . , a m ) is a terminal history if it leads from the root to a leafof the game tree; let Z denote the set of all terminal histories. For each player i ∈ N and terminalhistory z ∈ Z , a reward r i ( z ) ∈ R is obtained by player i upon reaching z . Each player i ∈ N has a set of information states U i which represent collections of nodes which are indistinguishableto the player. In particular, U i is a partition of { s ∈ S : P - ( s ) = i } with the additional conditionthat A ( s ) = A ( s (cid:48) ) if s and s (cid:48) are in the same information state. So, we can write A ( u ) for u ∈ U i unambiguously. Deﬁne U c = {{ s } : P - ( s ) = c } . We consider games of perfect recall , so that forevery player i , each u ∈ U i can be uniquely identiﬁed with the a sequence of information states andactions required to arrive there.Finally, the behavior of each player i ∈ N ∪ { c } is described by a policy π i (also known as abehavioral strategy), which is a function that maps each information state u ∈ U i to a distributionover the allowable actions A ( u ). A policy proﬁle is a tuple of player policies, π = ( π , . . . , π n ). Byconvention, the policy of the chance player π c is considered to be a ﬁxed part of the extensive-formgame itself and not a part of any policy proﬁle. For convenience, we introduce random variables that represent the actions selected by players in asingle play of the game. For each i ∈ N ∪ { c } , and for each u ∈ U i , let A ( u ) be a random variabletaking values in A ( u ) which represents the action player i would take given information state u .2 R A F T This variable always realizes a value, even if u is not reached in a particular play of the game.Note that for u (cid:54) = u (cid:48) ∈ U i , it need not be the case that A ( u ) is independent of A ( u (cid:48) ). This wayspecifying player behavior is quite general and can account for diﬀerent models of player actionselection. For example, a player may randomly precommit to a deterministic policy (this is knownas a mixed strategy in the game theory literature), or select actions independently at random ateach time step (a behavioral strategy) [2].For each terminal history z ∈ Z and player i ∈ N ∪ { c } , let m i ( z ) be the number of actionsselected by player i along z , so that for each j ∈ { , . . . , m i ( z ) } , we can write u iz,j and a iz,j to denotethe j th information state observed and action selected by player i along terminal history z . Deﬁne I iz,j = [ A ( u iz,j ) = a iz,j ] to be the Bernoulli random variable that indicates whether player i selects a iz,j at u iz,j . Finally, deﬁne I iz = (cid:81) m i ( z ) j =1 I iz,j to be the Bernoulli random variable that indicateswhether player i selects all actions along z . (If m i ( z ) = 0, set I iz ≡ z ∈ Z , I z = (cid:81) i ∈N ∪{ c } I iz deﬁnes a Bernoulli random variable such that the success probability P ( I z = 1) is the probability that terminal history z is realized. Let Z be a random terminal historyvariable such that P ( Z = z ) = P ( I z = 1) for all z that represents a random play-through of thegame. This allows us to cast the outcome of an extensive-form game as Y = [ r ( Z ) , . . . , r n ( Z )] . Write Y = r ( Z ) = r i ( Z ), the random reward for a particular player of interest i ∈ N ∪ { c } upona play of the game. Our goal is to express its variance, V ( Y ) = E { [ Y − E ( Y )] } , as a sum ofnonnegative terms corresponding to meaningful properties of a game. Let i ∈ N ∪ { c } be a player of interest, and let A i = [ A ( u )] u ∈U i be the concatenation of all actionsfor player i . By the law of total variance we can decompose the variance in game outcomes as V ( Y ) = V [ E ( Y | A i )] + E [ V ( Y | A i )] . (3.1)The term E ( Y | A i ) is the average game outcome upon many traversals of the game tree when player i commits ahead of time to playing the actions in A i . For example, E ( Y | A c ) represents the averageoutcome for a group of poker players who play the same hand from a deck with a particular cardorder many times, or the average outcome for a pair of chess players who start with the same colorsevery game. Then V [ E ( Y | A c )] is the variation in this mean as the chance actions A c vary, andrepresents the variation in game outcomes due to chance events. The latter term of (3.1) has asimilar interpretation as the variation in game outcomes not explained by actions selected by player i . Let i ∈ N ∪ { c } be a player of interest. Suppose that player i plays according to a behavioralstrategy π i , meaning that A ( u ) is independent of A ( u (cid:48) ) for all u (cid:54) = u (cid:48) ∈ U i and action probabilitiesare given by a policy such that P ( A iz,k = 1) = π i ( a iz,k | u iz,k ) for all z ∈ Z and k ∈ { , . . . , m i ( z ) } .No such assumption is required for the remaining players; we only require that their actions beindependent of the actions of player i .For i ∈ N ∪ { c } , deﬁne η i ( z ) = P ( I iz = 1), η − i ( z ) = P ( (cid:81) i (cid:48) ∈N ∪{ c }\{ i } I i (cid:48) z = 1), and η ( z ) = η i ( z ) η i ( z ) = P ( I z = 1). For each information state u ∈ U i , deﬁne Z ( u ) = { z ∈ Z : u is visited in z } and for each a ∈ A ( u ), deﬁne Z ( ua ) = { z ∈ Z : u is visited in z and action a is selected at u. } .Deﬁne q ( u, a ) = E [ r ( Z ) | Z ∈ Z ( u, a )] to be the expected outcome given that player i is at u andtakes action a ; similarly, deﬁne v ( u ) = E [ r ( Z ) | Z ∈ Z ( u )].3 R A F T Our main result is an expression of the variance in game outcomes explained by player i ’s actionsas a sum of weighted, squared action-value and value functions over all of player i ’s informationstates: V [ E ( Y | A i )] = (cid:88) u ∈U i (cid:18) (cid:88) a ∈A ( u ) [ q ( u, a )] π i ( a | u ) − [ v ( u )] (cid:19) η − i ( u ) η ( u ) . (3.2)A proof is provided in Appendix A. Computing this requires traversing the game tree a ﬁxed numberof times and hence is O ( |S| ). From this we obtain a formula for the other variance component byobserving that E [ V ( Y | A i )] = V ( Y ) − V [ E ( Y | A i )], where V ( Y ) can be evaluated as (cid:80) z ∈Z [ r ( z ) − (cid:80) z (cid:48) ∈Z r ( z (cid:48) ) η ( z (cid:48) )] η ( z ).Assuming η − i , q , and v are known, given an i.i.d. sequence of ν playthroughs of the game,each generating a sequence U k = ( U k, , . . . , U k,l k ) of observed information states in U i , then thefollowing is a consistent estimator for V [ E ( Y | A i )] as proved in Appendix B: ν − ν (cid:88) k =1 l k (cid:88) l =1 (cid:16) (cid:88) a ∈A ( U k,l ) [ q ( U k,l , a )] π i ( a | U k,l ) − [ v ( U k,l )] (cid:17) η − i ( U k,l ) . (3.3)In practice, q and v can be estimated by supervised learning and η i = η/η − i can be estimatedwith ˆ η ( u ) = ν − (cid:80) νk =1 (cid:80) l k l =1 ( U k,l = u ) and η i ( u ) = π i ( u ) (assuming the analyst does not haveaccess to opponent policies and observations). However, if there are many possible informationstates, i.e. |U i | is large, ˆ η ( u ) will greatly overestimate the visit probability. An alternative is themore straightforward regression-based estimator. The regression-based estimator works by ﬁttinga model for the conditional mean of the game outcome given a player’s actions, then computingthe empirical variance of the conditional mean estimator. The procedure is:1. Specify a parametric model f θ that maps the collection of all actions for the player of interestto a real number, f θ : × u ∈U i A ( u ) → R .2. For a each observed game k ∈ { , . . . , ν } , record action-outcome pairs ( A ik , Y k ). For each k ,if an information state for the player of interest, u ∈ U i was not visited in game k , sample A ( u ) ∼ π i ( ·| u ) and include the sampled action in A ik .3. Fit the model on the action-outcome pair data to ﬁnd a ˆ θ that minimizes the mean squareerror, ν − (cid:80) νk =1 [ Y k − f (cid:98) θ ( A ik )] , so f (cid:98) θ ( · ) estimates E ( Y | A i = · ).4. Compute the empirical variance of f (cid:98) θ ( A i ), which is ν − (cid:80) νk =1 [ f (cid:98) θ ( A ik ) − ν − (cid:80) νh =1 f (cid:98) θ ( A ih )] .This is our estimate of V [ E ( Y | A i )]. We analyze 150 thousand hands of heads-up no-limit poker played by diﬀerent players against thePoker AI Deepstack, including 45 thousand hands played by self-identiﬁed professional players. Fordetails on how the data were generated, see the supplemental materials of the DeepStack paper[17]. Our goal is to understand the role chance has in inﬂuencing the per-hand proﬁts of a humanplaying against Deepstack, so we will estimate the variance component for chance for games playedby each human player indexed by j ∈ { , . . . , } . We also include an algorithm used for pokeragent evaluation called Local Best Response (index j = 0), which we include as a form of transferlearning in order to improve estimates of expected outcomes for the human players. Assume that4 R A F T PlayerPlayer pocketDeepstack pocketFlopTurnRiver +++

Denselayers

Figure 2: The neural network architecture used for analysis of Deepstack hands. The input foreach card (shown in blue) is a concatenation of the rank and suit of the card. The rank and suitare each assigned a vector embedding, with the same weights shared for all card inputs.player j plays according to a policy π j and write E π j ( Y | A c ) to denote the expected per-handproﬁt for player j against DeepStack given all chance events A c . Then we would like to know V [ E π j ( Y | A c )] for each j .We use a neural network to estimate E π j ( Y | A c ) given a player and the realization of all chanceevents: • The player’s pocket cards (2 cards) • Deepstack’s pocket cards (2 cards) • The ﬂop (3 cards) • The turn (1 card) • The river (1 card)The neural network shares a representation of cards across all inputs: each card rank (e.g. Ace)and suit (e.g. hearts) is associated with a learned vector embedding; a card is represented by theconcatenation of these embeddings. To capture the unordered nature of players’ pocket cards andthe ﬂop, the card representations for each of those groups is summed. The architecture is depictedin Figure 2.Our model was trained by stochastic gradient descent with the Adam optimizer [13] with earlystopping based on cross-validation loss using a 90%/10% train-test split. If a hand ended before allchance events were observed (for example, if a player folded before the river), the cards associatedwith that chance event were randomly sampled from the remaining cards in the deck at that pointin the game. These cards were resampled in each epoch of training in order to decrease variance.We present our results in Table 1. For each player, the empirical variance of the regression estimatorwas computed over both the training and test data and is recorded in column “Chance var.”Typical values for the percent of total variance “explained” by chance events fall between0% and 2%. We conclude that the inﬂuence of chance events alone on per-hand outcomes is5 R A F T Player name Num. hands Mean proﬁt Variance. Chance var. Chance var. %Local best response 106221 -66.2 4711362.4 68577.4 1.5Ivan Shabalin 3122 -33.5 3419341.6 26105.1 0.8Pol Dmit 3026 -93.3 4992102.7 78447.3 1.6Muskan Sethi 3010 -214.1 8069582.3 152633.7 1.9Dmitry Lesnoy 3007 11.5 4422563.8 27593.1 0.6Stanislav Voloshin 3006 6.4 3272512.8 19559.7 0.6Lucas Schaumann 3004 -15.7 2585189.5 38743.3 1.5Phil Laak 3003 -77.3 3576155.5 37342.1 1.0Antonio Parlavecchio 3003 -108.8 7218296.0 116614.0 1.6Kaishi Sun 3002 -0.5 4138381.9 36490.0 0.9Martin Sturc 3001 51.3 2579614.6 26995.9 1.0Prakshat Shrimankar 3001 -17.4 3468418.3 43344.5 1.2Tsuneaki Takeda 1901 33.3 7458278.7 13850.9 0.2Youwei Qin 1759 -195.3 14797348.3 118693.4 0.8Fintan Gavin 1555 2.6 10967917.3 35274.5 0.3Giedrius Talacka 1514 -45.9 11464541.4 69281.2 0.6Juergen Bachmann 1088 -176.9 7804660.5 190849.5 2.4Sergey Indenok 852 -25.3 13895176.8 44966.3 0.3Sebastian Schwab 516 -180.0 6250606.2 25266.9 0.4Dara Okearney 456 -22.3 3365433.0 30225.4 0.9Roman Shaposhnikov 330 89.8 3951695.5 22054.5 0.6Shai Zurr 330 -115.4 4148165.0 43651.8 1.1Luca Moschitta 328 -143.8 4833549.3 76582.4 1.6Stas Tishekvich 295 34.6 3904856.9 68538.0 1.8Eyal Eshkar 191 -71.5 8773209.3 83846.6 1.0Jefri Islam 176 -382.2 10558538.8 67446.8 0.6Fan Sun 122 129.1 9265866.5 54513.5 0.6Igor Naumenko 102 -85.1 611611.4 32192.0 5.3Silvio Pizzarello 90 -513.4 10435348.4 96080.5 0.9Gaia Freire 76 -13.8 92173.2 44686.1 48.5Alexander B¨os 74 -0.1 1286240.1 7153.1 0.6Victor Santos 58 175.9 956344.0 24456.7 2.6Mike Phan 32 1122.2 25579723.6 25381.5 0.1Juan-Manuel Pastor 7 -728.6 1135714.3 26707.9 2.4

Table 1: Analysis of expected per-hand player proﬁts for human professionals against the DeepStackPoker AI.quite limited. Rather, the large amount of variation in per-hand proﬁts is mostly explained byplayer randomization and the interaction between those actions and chance. We elaborate in theDiscussion section.

As another example of using variance decompositions to analyze games, we present a concept formeasuring skill, chance, and non-transitivity that is inspired by prior work on decompositionsof games [10] and recent developments regarding learning in the context of complex games withnontransitive elements [19] [20]. For simplicity, assume we are given a symmetric two-player zero-sum game and a population of players represented by a ﬁnite set of policies Π, each with a skillrating ρ π for π ∈ Π. One notion of the skillfulness of the game is the variance in outcomes explainedby players’ skill ratings alone, assuming two policies ( π , π ) are sampled uniformly from Π: V ( Y ) = V [ E ( Y | ρ π , ρ π )] + E [ V ( Y | ρ π , ρ π )] . (5.1)6 R A F T a \ a Rock Paper ScissorsRock 0 -1 1Paper 1 0 -1Scissors -1 1 0Table 2: The payoﬀ function RPS( a , a ).Applying the law of total variance to the conditional variance V ( Y | ρ π , ρ π ), we conditionon chance actions A c as in (3.1) to obtain V ( Y | ρ π , ρ π ) = V [ E ( Y | A c , ρ π , ρ π ) | ρ π , ρ π ] + E [ V ( Y | A c , ρ π , ρ π ) | ρ π , ρ π ]. Using linearity of expectation and the tower rule, this allows usto extend (5.1) to V ( Y ) = V [ E ( Y | ρ π , ρ π )] (cid:124) (cid:123)(cid:122) (cid:125) skill + E { V [ E ( Y | A c , ρ π , ρ π ) | ρ π , ρ π ] } (cid:124) (cid:123)(cid:122) (cid:125) chance + E [ V ( Y | A c , ρ π , ρ π )] (cid:124) (cid:123)(cid:122) (cid:125) remaining variation . (5.2)We apply this formula to analyze a simple game parametrized by constants n ∈ N , c ∈ N ∪ { } ,and α ∈ [0 ,

1] that can be seen as an abstract model of a game with a skill component (somestrategies are strictly better than others), a nontransitive component (there exist cycles of purestrategies), and chance (some games are decided by events entirely out of the players’ hands).Skillful Rock Paper Scissors, or SkillRPS( n , c , α ), is deﬁned as follows: each player i ∈ { , } simultaneously selects a number N i ∈ { , . . . , n } and a move A i ∈ { Rock, Paper, Scissors } . Player1’s score is S = N − N + c · RPS( A , A ), where RPS is the payoﬀ function for Rock Paper Scissorsdepicted in Table 2.The outcome of the game for player 1 is Y = (1 − W )[ ( S > − ( S < W (2 Z − W ∼ Bernoulli( α ) and Z ∼ Bernoulli(1 /

2) are chance events such that W determines whether thegame is decided by a fair coin ﬂip Z . Note that when n = 1, c > α = 0, the game is classic RockPaper scissors, when α = 1 it is a coin ﬂip, and when c = 0 it is a transitive game. The game canbe represented in extensive form as shown in Figure 1, which depicts SkillRPS(2, 0, 0 . One might hope that the variance component for chance V [ E ( Y | A c )] measures how lucky a gameis in the context of the players playing the game. We argue that this is not the case, and concludewith thoughts on the applicability of variance component estimation for the analysis of games.First, the variance component for chance does not measure how lucky a game is because bydesign it avoids measuring variation introduced by random player actions. Consider the classicversion of Rock Paper Scissors (RPS) depicted in Table 2. A cautious player can guarantee anexpected payoﬀ of 0 by assigning uniform probability to each action, causing the outcome of thegame to be uniformly random over {− , , } . For this reason, it is natural to view RPS as a gameof luck— however, RPS as typically modeled does not have a chance player. All variation in RPS7 R A F T Figure 3: Three-way variance decompositions for SkillRPS with diﬀerent game parameters underthe assumption that players selects moves independently and uniformly at random, i.e. for i ∈{ , } , N i ∼ Uniform( { , . . . , n } ) and A i ∼ Uniform( { Rock, Paper, Scissors } ) and are independent.Details on the variance components for SkillRPS are included in Appendix C.comes from randomness in player action selection. So, if we are to call RPS a game of luck, then anotion of luck that only considers chance events is inadequate.Second, the variance component for chance is conservative in that it only measures the marginal(average) eﬀect of chance actions on game outcomes. It does not capture the interaction betweenchance events and player actions. For example, consider a variant of RPS in which one of the playersis replaced with a chance player. If the non-chance player employs a uniform random policy, thenthe expected outcome is 0 regardless of action is selected by chance. Thus E ( Y | A c = a ) = 0 foreach a ∈ { Rock , Paper , Scissors } . This means that for any chance policy, the variance componentfor chance is 0, yet from the player’s perspective, against a uniform chance policy, it is as thoughthe game outcome is entirely determined by chance!What the variance component for chance actually measures is the per-game amount that chancebiases the outcome in favor of a player. In both the examples given above, luck plays a signiﬁcantrole in the game outcomes, but the realization of chance events alone does not tend to signiﬁcantlytilt the game in the favor of either player– so our measure evaluates to 0. Returning to the analysisof DeepStack poker hands, we can see that despite the large amount of variation in per-hand proﬁts(of which any one realization could be called “lucky”) the game (as played at a high level) is in somesense fair: on a hand-by-hand basis, the average amount that the random deck order advantagesor disadvantages a particular player is small.Video game designers may ﬁnd the variance component for chance helpful in assessing theper-play advantage gleaned by a player due to chance events. We speculate that for a rewardinggame experience, the variance component should be kept low, or else players will a sense of limitedagency. Returning to the question of the legality of poker, our measure could represent a suﬃcient(but not necessary) criterion for determining that a game is “predominantly due to chance:” if theratio of the variance component for the chance player to the total variation is greater than 50%,then clearly the game outcomes could be said to be predominantly due to chance. The three-way8 R A F T variance decomposition in (5.2) oﬀers a way to characterize meaningful properties of games thatarise in the context of multiagent reinforcement learning and presents new research challenges suchas (i) accounting for estimation error in the skill rating (however it is deﬁned), and (ii) accountingfor the actual distribution from which policies are sampled to play each other, which is often notuniform but rather skill-based, such that players with nearby skill ratings are likely to be placedtogether. References [1] Ronald A Fisher. “XV.—The correlation between relatives on the supposition of Mendelianinheritance.” In:

Earth and Environmental Science Transactions of the Royal Society of Ed-inburgh

International Journal of Game Theory

Gaming law review

Chance

Gaming Law Review

GamingLaw Review and Economics

Review of Law & Economics

Multiagent systems: Algorithmic, game-theoretic,and logical foundations . Cambridge University Press, 2008.[9] Marc Lanctot et al. “Monte Carlo sampling for regret minimization in extensive games”. In:

Advances in neural information processing systems . 2009, pp. 1078–1086.[10] Ozan Candogan et al. “Flows and decompositions of games: Harmonic and potential games”.In:

Mathematics of Operations Research

Characteristics of games . MITPress, 2012.[12] Randall D Heeb.

Report of Randall D. Heeb, PHD (United States of American againstLawrence Discristina) . Case1:11-cr-00414 Document 77-1. July 2012.[13] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXivpreprint arXiv:1412.6980 (2014).[14] Steven D Levitt and Thomas J Miles. “The role of skill versus luck in poker evidence fromthe world series of poker”. In:

Journal of Sports Economics

International Conference on Machine Learning . 2015, pp. 805–813.[16] Rogier JD Potter van Loon, Martijn J van den Assem, and Dennie van Dolder. “Beyondchance? The persistence of performance in online poker”. In:

PLoS one R A F T [17] Matej Moravcik et al. “Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limitpoker”. In: Science

SIAM Review arXiv preprintarXiv:1901.08106 (2019).[20] Shayegan Omidshaﬁei et al. “Navigating the Landscape of Games”. In: arXiv preprintarXiv:2005.01642 (2020).

A Variance component formula derivation

Recall that I z = (cid:81) i ∈N ∪{ c } I iz = (cid:81) i ∈N ∪{ c } (cid:81) m i ( z ) j =1 I iz,j is the indicator that all actions along terminalhistory z are selected, Y = (cid:80) z ∈Z r ( z ) I z , and u iz,j is the j th information state observed by player i in terminal history z . Write I iz,k : = (cid:81) m i ( z ) j = k I iz,j , the indicator that player i selects all actionsin z at and after u iz,k . Let d ( u ) be the depth of u in its trajectory; for example, if u is the ﬁrstobservation of a player in their trajectory, d ( u ) = 1. Deﬁne W u = (cid:80) z ∈Z ( u ) r ( z ) η − i ( z ) I iz,d ( u ): and U id = { u ∈ U i : d ( u ) = d } . Then V [ E ( Y | A i )] = V (cid:20) (cid:88) z ∈Z r ( z ) η − i ( z ) I iz (cid:21) = V (cid:20) (cid:88) u ∈U i (cid:88) z ∈Z ( u ) r ( z ) η − i ( z ) I iz (cid:21) = V (cid:18) (cid:88) u ∈U i W u (cid:19) . (A.1)Note that histories z ∈ Z that contain no information states for player i have I iz ≡

1, so theyare constant inside the conditional expectation, which is why the second and third expressions areequal.By the perfect recall assumption, each information state u can be uniquely identiﬁed withthe sequence of information states and actions required to reach u . Furthermore, the behavioralstrategy assumption gives that A ( u ) is independent of A ( u (cid:48) ) if u (cid:54) = u (cid:48) ∈ U i . Therefore if u iz,j (cid:54) = u iz (cid:48) ,j for some j , then I iz,h is independent of I iz (cid:48) ,h (cid:48) for all h, h (cid:48) ∈ { j, . . . , min[ m i ( z ) , m i ( z (cid:48) )] } . We concludethat W u is independent of W u (cid:48) if u (cid:54) = u (cid:48) . This allows us to split up (A.1): V [ E ( Y | A i )] = V (cid:18) (cid:88) u ∈U i W u (cid:19) = (cid:88) u ∈U i V ( W u ) = (cid:88) u ∈U i (cid:18) V { E [ W u | A ( u )] } + E { V [ W u | A ( u )] } (cid:19) . (A.2)The last equality holds by the law of total variance. To evaluate the components of (A.2), write W ua = (cid:80) z ∈Z ( ua ) r ( z ) η − i ( z ) I iz, [ d ( u )+1]: for each a ∈ A ( u ) so we have that W u = (cid:88) z ∈Z ( u ) r ( z ) η − i ( z ) I iz,d ( u ): = (cid:88) a ∈A ( u ) (cid:88) z ∈Z ( ua ) r ( z ) η − i ( z ) I iz,d ( u ) I iz, [ d ( u )+1]: = (cid:88) a ∈A ( u ) W ua ( A ( u ) = a ) . R A F T Now evaluate each variance component. We begin with: V { E [ W u | A ( u )] } = V (cid:20) E (cid:18) (cid:88) a ∈A ( u ) W ua ( A ( u ) = a ) (cid:12)(cid:12)(cid:12)(cid:12) A ( u ) (cid:19)(cid:21) = V (cid:20) (cid:88) a ∈A ( u ) E ( W ua ) ( A ( u ) = a ) (cid:21) = (cid:88) a ∈A ( u ) E ( W ua ) π i ( a | u ) − (cid:20) (cid:88) a ∈A ( u ) E ( W ua ) π i ( a | u ) (cid:21) . Write r ( u, a ) = E { r ( Z ) [ Z ∈ Z ( u, a )] } and r ( u ) = E { r ( Z ) [ Z ∈ Z ( u )] } . Then E ( W ua ) = (cid:88) z ∈Z ( ua ) r ( z ) η − i ( z ) (cid:81) m i ( z ) j = d ( u )+1 π ( a iz,j | u iz,j )= (cid:88) z ∈Z ( ua ) r ( z ) η − i ( z ) (cid:81) m i ( z ) j =1 π ( a iz,j | u iz,j ) (cid:2) (cid:81) d ( u ) j =1 π ( a iz,j | u iz,j ) (cid:3) − = (cid:88) z ∈Z ( ua ) r ( z ) η ( z ) [ η i ( u ) π ( a | u )] − = r ( u, a ) [ η i ( u ) π ( a | u )] − . Substituting these terms back into the expression for the variance component, we get that V { E [ W u | A ( u )] } = (cid:88) a ∈A ( u ) [ r ( u, a )] [ η i ( u )] − /π i ( a | u ) − (cid:20) [ η i ( u )] − (cid:88) a ∈A ( u ) r ( u, a ) (cid:21) = [ η i ( u )] − (cid:18) (cid:88) a ∈A ( u ) [ r ( u, a )] π i ( a ; u ) − [ r ( u )] (cid:19) . For the second variance component, we ﬁnd: E { V [ W u | A ( u )] } = (cid:88) a ∈A ( u ) V [ W u | A ( u ) = a ] P [ A ( u ) = a ]= (cid:88) a ∈A ( u ) V (cid:18) (cid:88) z ∈Z ( ua ) r ( z ) η − i ( z ) I iz, [ d ( u )+1]: (cid:19) π i ( a | u ) . Now take Y = (cid:80) z ∈Z ( ua ) r ( z ) η − i ( z ) I iz, [ d ( u )+1]: and repeat the steps shown in (A.2) inductively toobtain that: V [ E ( Y | A i )] = (cid:88) u ∈U i (cid:18) (cid:88) a ∈A ( u ) [ r ( u, a )] /π i ( a | u ) − [ r ( u )] (cid:19) /η i ( u ) . Because r ( u, a ) = q ( u, a ) η ( u ) π i ( a | u ) and r ( u ) = v ( u ) η ( u ), this yields (3.2).11 R A F T B Proof of consistency

Let µ ( u ) = η ( u ) / (cid:80) u ∈U i η ( u ). Then V [ E ( Y | A i )] = (cid:88) u ∈U i (cid:18) (cid:88) a ∈A ( u ) [ q ( u, a )] π i ( a | u ) − [ v ( u )] (cid:19) η − i ( u ) η ( u )= (cid:18) (cid:88) u ∈U i η ( u ) (cid:19) (cid:88) u ∈U i (cid:18) (cid:88) a ∈A ( u ) [ q ( u, a )] π i ( a | u ) − [ v ( u )] (cid:19) η − i ( u ) µ ( u )= (cid:18) (cid:88) u ∈U i η ( u ) (cid:19) E U ∼ µ (cid:20)(cid:18) (cid:88) a ∈A ( U ) [ q ( U, a )] π i ( a | U ) − [ v ( U )] (cid:19) η − i ( U ) (cid:21) Note that (cid:88) u ∈U i η ( u ) = (cid:88) u ∈U i (cid:88) z ∈Z ( u ) η ( z ) = (cid:88) u ∈U i (cid:88) z ∈Z η ( z ) ( u ∈ z ) = (cid:88) z ∈Z η ( z ) (cid:88) u ∈U i ( u ∈ z ) = E [ d i ( Z )] , where d i ( Z ) is the length of the trajectory for player i in terminal history Z . So, by the law of largenumbers, ν − (cid:80) νk =1 d i ( Z k ) a.s. → (cid:80) u ∈U i η ( u ) as ν → ∞ . Consider the Markov Chain { U t } t ∈ N deﬁnedby the information states for player i observed upon repeated independent playthroughs of the gameand let φ ( u ) = { (cid:80) a ∈A ( u ) [ q ( u, a )] π i ( a | u ) − [ v ( u )] } η − i ( u ). Then T − (cid:80) Tt =1 φ ( U t ) a.s. → E U ∼ µ [ φ ( U )]as T → ∞ by a Law of Large Numbers for Markov Chains since { U t } is irreducible and positiverecurrent.Converting both these results to the notation of the original statement of the estimator, wehave ν − (cid:80) νk =1 l k a.s. → (cid:80) u ∈U i η ( u ) and (cid:0) (cid:80) νk =1 l k (cid:1) − (cid:80) νk =1 (cid:80) l k l =1 φ ( U k,l ) a.s. → E U ∼ µ [ φ ( U )] as ν → ∞ .Therefore their product converges to the estimand, as desired. C SkillRPS decomposition details

Recall that in SkillRPS, the outcome is Y = (1 − W )[ ( S > − ( S < W (2 Z − S = N − N + c · RPS( A , A ). In this case, a player’s selection of N i is considered to indicatetheir skill level, and A c = ( W, Z ) is the collection of all chance actions. Adapting the three-waydecomposition equation (5.2) to SkillRPS yields V ( Y ) = V [ E ( Y | N , N )] (cid:124) (cid:123)(cid:122) (cid:125) skill + E { V [ E ( Y | W, Z, N , N ) | N , N ] } (cid:124) (cid:123)(cid:122) (cid:125) chance + E [ V ( Y | W, Z, N , N )] (cid:124) (cid:123)(cid:122) (cid:125) remaining variation . Under the assumption that N , N ∼ Uniform( { , . . . , n } ) and are independent of A , A ∼ Uniform( { Rock, Paper, Scissors } ), we can derive closed form expressions for each variance com-ponent.Using routine probability manipulations, one can derive the following term for the variance in Y explained by the “skill” of the players in the case that the coin ﬂip didn’t happen ( W = 0),for all n ∈ N and c ∈ N ∪ { } . Begin by ﬁnding E ( Y | N = n , N = n , W = 0) for arbitrary n , n ∈ { , . . . , n } , which is easy since the only remaining source of variation is RPS( A , A ) ∼ Uniform( {− , , } ). Next, treat this term as a discrete random variable depending on N and N and compute its variance. This yields: V [ E ( Y | N , N , W = 0)] =  − n if c = 01 − n + c +2 c − cn n if 0 < c < n (1 − n ) / c ≥ n. R A F T Call this term ψ ( n, c ). From here one can ﬁnd that: V [ E ( Y | N , N )] = (1 − α ) ψ ( n, c ) E { V [ E ( Y | W, Z, N , N ) | N , N ] } = α + α (1 − α ) ψ ( n, c ) . Finally, we also get that E [ V ( Y | W, Z, N , N )] =  c = 0(1 − α ) [1 − n + c n − ψ ( n, c )] if 0 < c < n (1 − α ) ( − n ) if c ≥ n.n.