Estimating α-Rank by Maximizing Information Gain
EEstimating α -Rank by Maximizing Information Gain Tabish Rashid * , Cheng Zhang, Kamil Ciosek University of Oxford Microsoft Research, Cambridge, [email protected], { cheng.zhang, kamil.ciosek } @microsoft.com Abstract
Game theory has been increasingly applied in settings wherethe game is not known outright, but has to be estimated bysampling. For example, meta-games that arise in multi-agentevaluation can only be accessed by running a succession ofexpensive experiments that may involve simultaneous de-ployment of several agents. In this paper, we focus on α -rank,a popular game-theoretic solution concept designed to per-form well in such scenarios. We aim to estimate the α -rankof the game using as few samples as possible. Our algorithmmaximizes information gain between an epistemic belief overthe α -ranks and the observed payoff. This approach has twomain benefits. First, it allows us to focus our sampling on theentries that matter the most for identifying the α -rank. Sec-ond, the Bayesian formulation provides a facility to build inmodeling assumptions by using a prior over game payoffs.We show the benefits of using information gain as comparedto the confidence interval criterion of ResponseGraphUCB(Rowland et al. 2019), and provide theoretical results justi-fying our method. Traditionally, game theory is applied in situations wherethe game is fully known. More recently, empirical gametheory addresses the setting where this is not the case, in-stead, the game is initially unknown and has to be interactedwith by sampling (Wellman 2006). One area in which thisis becoming increasingly common is the ranking of trainedagents relative to one another. Specifically, in the field ofReinforcement Learning game-theoretic rankings are usednot just as a metric for measuring algorithmic progress(Balduzzi et al. 2018), but also as an integral componentof many population-based training methods (Muller et al.2020; Lanctot et al. 2017; Vinyals et al. 2019). In particu-lar, for ranking, two popular solution concepts have recentlyemerged: Nash averaging (Balduzzi et al. 2018; Nash 1951)and α -rank (Omidshafiei et al. 2019).In this paper, we aim to estimate the α -rank of a game us-ing as few samples as possible. We use the α -rank solutionconcept for two reasons. First, it admits a unique solutionwhose computation easily scales to K -player games. Sec-ond, unlike older schemes such as Elo (Elo 1978), α -rank * is designed with intransitive interactions in mind. Becausemeasuring payoffs can be very expensive, it is important todo it by using as few samples as possible. For example, play-ing a match of chess (Silver et al. 2017) can take roughly 40minutes (assuming a typical game-length of 40 and up to 1minute per move as used during evaluation), and playing afull game of Dota 2 can take up to 2 hours (Berner et al.2019). Our objective is thus to accurately estimate the α -rank using a small number of payoff queries. Rowland et al. (2019) proposed ResponseGraphUCB(RG-UCB) for this purpose, inspired by pure explorationbandit literature. RG-UCB maintains confidence intervalsover payoffs. When they don’t overlap, it draws a conclu-sion about their ordering, until all comparisons relevant forthe computation of α -rank have been made. While this isprovably sufficient to determine the true α -rank with a highprobability in the infinite- α regime, their approach has twoimportant limitations. First, since the frequentist criterion isindirect, relying on payoff ordering rather than the α -rank,the obtained payoffs aren’t always used optimally. Second,it is nontrivial to include useful domain knowledge about theentries or structure of the payoff matrix.To remedy these problems, we propose a Bayesian ap-proach. Specifically, we utilize a Gaussian Process to main-tain an epistemic belief over the entries of the payoff ma-trix, providing a powerful framework in which to supply do-main knowledge. This payoff distribution induces an epis-temic belief over α -ranks. We determine which payoff tosample by maximizing information gain between the α -rankbelief and the obtained payoff. This allows us to focus oursampling on the entries that are expected to have the largesteffect on our belief over possible α -ranks. Contributions:
Theoretically, we justify the use of in-formation gain by showing a regret bound for a version ofour criterion in the infinite- α regime. Empirically, our con-tribution is threefold. First, we compare to RG-UCB on styl-ized games, showing that maximizing information gain pro-vides competitive performance by focusing on sampling themore relevant payoffs. Second, we evaluate another objec-tive based on minimizing the Wasserstein divergence, whichoffers competitive performance while being computationallymuch cheaper. Finally, we demonstrate the benefit of build-ing in prior assumptions. a r X i v : . [ c s . M A ] J a n Background
Games and α -Rank A game with K players, each ofwhom can play S k strategies is characterized by its expectedpayoffs M ∈ R S × ... × S K (Fudenberg and Tirole 1991).Letting S = S × ... × S K be the space of pure strategyprofiles, the game also specifies a distribution over the pay-offs associated with each player when s ∈ S is played. The α -rank of a game is computed by first defining an irreducibleMarkov Chain whose nodes are pure strategy profiles in S .We denote the stochastic matrix defining this chain as C .The transition probabilities of the chain C are calculated asfollows: Let σ, τ ∈ S be such that τ only differs from σ ina single player’s strategy and let η = ( (cid:80) Kk =1 ( | S k | − − be the reciprocal of the total number of those distinct τ . Let M k ( σ ) denote the expected payoff for player k when σ isplayed. Then, the probability of transitioning from σ to τ which varies only in player k ’s strategy is C σ,τ = (cid:40) η − exp( − α ( M k ( τ ) − M k ( σ )))1 − exp( − αm ( M k ( τ ) − M k ( σ ))) if M k ( τ ) (cid:54) = M k ( σ ) ,ηm otherwise .C σ,υ = 0 for all υ that differ from σ in more than a singleplayer’s strategy, C σ,σ = 1 − (cid:80) τ (cid:54) = σ C σ,τ to ensure a validtransition distribution, and α ≥ , m ∈ N > are parame-ters of the algorithm. We define the α -rank r ∈ R |S| as theunique stationary distribution of the chain C (Omidshafieiet al. 2019; Rowland et al. 2019) as α → ∞ . In practice, alarge finite value of α is used, or a perturbed version of thetransition matrix C is used with an infinite α to ensure theresulting Markov Chain C is irreducible. Single Population α -Rank In this paper we focus on theinfinite- α regime and restrict our attention to the 2-playersingle population case of α -rank which differs slightly fromabove. Importantly, our method can be easily applied to mul-tiple populations as described above in a straightforwardway, but we focus on the single population case for sim-plicity. Let S = S and M ( σ, τ ) denote the payoff when thefirst player plays σ and the second player plays τ . Note that S = S since the single population case considers a playerplaying a game against an identical player. In this particu-lar setting, the α -rank r ∈ R | S | and the perturbed transitionmatrix C ∈ R S × S is calculated as follows: C σ,τ = ( | S | − − (1 − (cid:15) ) if M ( τ, σ ) > M ( σ, τ ) , ( | S | − − (cid:15) if M ( τ, σ ) < M ( σ, τ ) , . | S | − − if M ( τ, σ ) = M ( σ, τ ) , for σ (cid:54) = τ . C σ,σ = 1 − (cid:80) τ (cid:54) = σ C σ,τ again to ensure a validtransition distribution and (cid:15) is a small perturbation to ensureirreducibillity of the resulting chain. We abstract the abovecomputation into the α -rank function f : M → R | S | , where M is the space of 2-player payoff matrices with S strategiesfor each player. Wasserstein Divergence
Let p and q be probability dis-tributions supported on X , and c : X × X → [0 , ∞ ) be a distance. Define Π as the space of all joint probability dis-tributions with marginals p and q . Wasserstein divergence(Villani 2008) with cost function c , is defined as: W c ( p, q ) := min π ∈ Π (cid:82) X ×X c ( x, y ) dπ ( x, y ) . In this paper, we will utilize the Wasserstein distance be-tween our belief distributions over α -rank, and so we set X = ∆ S − , the ( S − probability simplex, and use c ( x, y ) = (cid:107) x − y (cid:107) , i.e. the total variation distance. Wewill drop the suffix and denote this simply as W . There are many methods related to the ranking and eval-uation of agents in games. Elo (Elo 1978) and TrueSkill(Herbrich, Minka, and Graepel 2007; Minka, Cleven, andZaykov 2018) both quantify the performance of an agentusing a single number, which means they are unable tomodel intransitive interactions. Chen and Joachims (2016)extend TrueSkill to better model such interactions, whileBalduzzi et al. (2018) do the same for Elo, improving itspredictive power by introducing additional parameters. Bal-duzzi et al. (2018) also re-examines the use of Nash equilib-rium, proposing to disambiguate across possible equillibriaby picking the one with maximum entropy. However, it iswell known that computing the Nash equilibrium is com-putationally difficult (Daskalakis, Goldberg, and Papadim-itriou 2009) and only computationally tractable for restrictedclasses of games. In this paper, we focus on α -rank (Omid-shafiei et al. 2019) since it has been designed with intransi-tive interactions in mind, it is computationally tractable for N -player games and shows considerable promise as a com-ponent of self-play frameworks (Muller et al. 2020).Empirical Game Theory (Wellman 2006) is concernedwith situations in which a game can only be interactedwith through sampling. The most related work to ours in-vestigates sampling strategies and concentration inequalitiesfor the Nash equilibrium as opposed to the α -rank. Walsh,Parkes, and Das (2003) introduce Heuristic Payoff Tables(HPTs) in order to choose the samples that provide the mostinformation about the currently chosen Nash equilibrium,where information is quantified as the reduction in estimatederror. This differs from our approach both in the use of α -rank as opposed to the Nash equilibrium as our solution con-cept, and in the criterion used to select the observed payoff.Tuyls et al. (2020) provide concentration bounds for esti-mated Nash equilibria. Jordan, Vorobeychik, and Wellman(2008) find Nash equilibria from limited data by using infor-mation gain on distributions over strategies, a concept differ-ent from our information gain on distributions over ranks.We also utilize α -rank as the solution concept, rather thanNash equilibria.Muller et al. (2020) utilise α -rank as part of a PSRO(Lanctot et al. 2017) framework. They do not use an adap-tive sampling strategy for deciding which entries to sam-ple, but are a natural application for applying our algorithm(and RG-UCB). Yang et al. (2019) introduce an approximategradient-based algorithm which does not require access tothe entire payoff matrix at once in order to compute α -rank.lthough their method does not require the entire payoff ma-trix at every iteration, it is not designed for operating in thesame incomplete information setting that we explore in thispaper since they assume every entry can be cheaply queriedwith no noise. Srinivas et al. (2009) prove regret bounds forBayesian optimization with GPs. We use their concentrationresult to derive our bounds as well as as inspiration for ourinformation gain criterion. ResponseGraphUCB
Closest to our work is ResponseGraphUCB (RG-UCB) in-troduced by Rowland et al. (2019), which can be viewedas a frequentist analogue to our method which also oper-ates in the infinite- α regime. RG-UCB first specifies an errorthreshold δ > and then samples payoffs until a stoppingcriteria determines the estimated α -rank is correct with prob-ability at least − δ . A key observation that RG-UCB relieson, is that in the infinite- α regime only the ordering betweenrelevant payoffs is important. e.g. For pure strategy profiles σ and τ (with payoffs M k ( σ ) and M k ( τ ) respectively) thatare used in the computation of the Markov Chain transi-tion probabilities, determining whether M k ( σ ) > M k ( τ ) or M k ( σ ) < M k ( τ ) is enough to know the transition probabil-ity accurately (their magnitude difference | M k ( σ ) − M k ( τ ) | is unimportant). RG-UCB maintains (1 − δ ) confidence in-tervals for all values of M k ( σ ) , and determines the orderingbetween σ and τ is correct when they do not overlap. A strat-egy profile is chosen to be sampled until all of its orderingare correctly determined. When all orderings are correctlydetermined the algorithm terminates.Since the confidence intervals are constructed using fre-quentist concentration inequalities, we refer to RG-UCBas being a frequentist algorithm. In contrast, our Bayesianperspective provides a principled method for incorporatingprior knowledge into our algorithm whereas it is much moredifficult to encode modelling assumptions and prior knowl-edge with RG-UCB. The second important difference be-tween our work and RG-UCB is that our information gaincriterion is a direct objective, which selects the payoffs tosample based on how likely the received sample is to affectthe α -rank. On the other hand, RG-UCB works indirectly,reducing uncertainty about the orderings between individualpayoffs without considering their impact on the final α -rank,which makes it less efficient. Rowland et al. (2019) also the-oretically justify the use of RG-UCB in the infinite- α regimeby proving sample complexity results, whereas we provideasymptotic regret bounds for our approach which are com-monly used to justify the sample efficiency of a Bayesianalgorithm (Srinivas et al. 2009). Rowland et al. (2019) ad-ditionally provide a method for obtaining uncertainty esti-mates in the infinite- α regime, which is, however, not usedas part of an adaptive sampling strategy. On a high level, our method works by maintaining an epis-temic belief over α -ranks and selecting payoffs that lead tothe maximum reduction in the entropy of that belief. Fig-ure 1 provides a pictorial overview. In the middle of the fig-ure, we maintain an explicit distribution over the entries of Figure 1: Overview:
On the left, a belief over α -ranks is in-duced by a belief over the payoff matrix shown in the mid-dle. A hallucinated belief distribution is shown on the right.the payoff matrix. This payoff distribution induces a beliefover α -ranks, shown on the left. When deciding which pay-off to sample, we examine hypothetical belief states aftersampling, striving to end up with a belief with the lowestentropy. One such hypothetical, or ‘hallucinated’ belief isshown on the right. We now describe our method formally,first describing the probabilistic model and then the imple-mentation. Payoffs: Ground Truth and Belief
We denote the un-known true payoff matrix as M (cid:63) . To quantify our uncertaintyabout what this true payoff is, we employ a Gaussian Pro-cess M , which also allows us to encode prior knowledgeabout payoff dependencies. Our framework is sufficientlygeneral to allow for other approaches such as Bayesian Ma-trix Factorization (Salakhutdinov and Mnih 2008) or prob-abilistic methods for Collaborate Filtering (Su and Khosh-goftaar 2009) to be used. We choose to use Gaussian Pro-cesses due to their flexibility in encoding prior knowledgeand modelling assumptions, and their ubiquity throughoutliterature.The GP models noise in the payoffs as ˜ M = M + (cid:15) , where (cid:15) ∼ N (0 , Iσ A ) . When interacting with the game sequen-tially, the received payoffs are assumed to be generated as m t = M (cid:63) ( a t ) + (cid:15) (cid:48) t . Here, (cid:15) (cid:48) t are i.i.d. random variables withsupport on the interval [ − σ A , σ A ] . While it may at first seemsurprising that we use Gaussian observation noise in the GPmodel, while assuming a truncated observation noise for theactual observation, this does not in fact affect our theoret-ical guarantees. We provide more details in Section 6. Wedenote the history of interactions at time t by H t . Becauseof randomness in the observations, H t is a random variable.The sequence of random variables H , H , . . . forms a fil-tration. We use the symbol h t to denote particular realizationof history so that h t = a , m , . . . , a t − , m t − . Belief over α -ranks Our explicit distribution over the en-tries of the payoff matrix M induces an implicit belief dis-tribution over the α -ranks. For all valid α -ranks r , P ( r ) = P ( M ∈ f − ( r )) where f − denotes the pre-image of r un-der f . In other words, the probability assigned to an α -rank r is the probability assigned to its pre-image by our beliefover the payoffs. Since r is represented implicitly, we can-not query its mass function directly. Instead, we access r via sampling. This is done by first drawing a payoff from m ∼ M and then computing the resulting α -rank f ( m ) . icking Payoffs to Query At time t , we query the payoffthat provides us with the largest information gain about the α -rank. Formally, a t = arg max a I ( r ; ( ˜ M t ( a ) , a ) | H t = h t )= arg max a H ( r | H t = h t ) − E ˜ m t ∼ ˜ M t ( a ) (cid:104) H (cid:16) r | H t = h t , A t = a, ˜ M t ( a ) = ˜ m t (cid:17)(cid:105) (1) = arg min a E ˜ m t ∼ ˜ M t ( a ) (cid:104) H (cid:16) r | H t = h t , A t = a, ˜ M t ( a ) = ˜ m t (cid:17)(cid:105) . (2) In Equation (1), H ( r | H t = h t ) is the entropy of our cur-rent belief distribution over α -ranks, which does not dependon a and can be dropped from the maximization, producingEquation (2). The expectation in (2) has an intuitive interpre-tation as the expected negative entropy of our hallucinated belief, i.e. belief obtained by conditioning on a sample ˜ m t from the current model. In essence, we are pretending to re-ceive a sample for entry a , and then computing what ourresulting belief over α -ranks will be. By picking the entry asin (2), we are picking the entry whose sample will lead to thelargest reduction in the entropy of our belief over α -ranks inexpectation. Algorithm 1 α IG algorithm. α IG(NSB) and α IG(Bin) vari-ants differ in entropy estimator (Line 7). for t = 1 , , . . . T do for a = 1 , , . . . |S| do for i = 1 , , . . . N e do ˜ m t ∼ ˜ M t ( a ) (cid:46) ‘Hallucinate’ a payoff.5: Obtain hallucinated posterior payoff: P ( ˆ M t | H t = h t , A t = a, ˜ M t ( a ) = ˜ m t ) D = { r , . . . , r N b } , where r i ∼ f ( ˆ M t ) i.i.d.7: ˆ h ia = ESTIMATE - ENTROPY ( D )8: end for ˆ h a = N e (cid:80) N e i =1 ˆ h ia end for
11: Query payoff a t = arg min a ˆ h a (cid:46) Implements Eq. (2).12: end for
Implementation
Our algorithm, which we refer to as α IG,is summarized in Algorithm 1. At a high-level, α IG selectsan action/payoff to query at each timestep (Line 1). In orderto select a payoff to query as in Equation (2), we must ap-proximate the expectation for each payoff (Line 2). In Line4, we use our epistemic model to obtain a ‘hallucinated’ out-come ˜ m t , as if we received a sample from selecting payoff a at timestep t . In Line 5, we condition our epistemic model onthis ‘hallucinated’ sample ˜ m t in order to obtain our ‘halluci-nated’ posterior over payoffs ˆ M t . In Line 7, we empiricallyestimate the entropy of the resulting induced belief distri-bution over α -ranks. To approximate the expectation in (2),we average out entropy estimates obtained from N e differ-ent possible hallucinated payoffs in Line 9. Finally, in Line 11, we use these estimates to perform query selection as in(2) to select a payoff to query at timestep t .Our algorithm depends on an entropy estimator ESTIMATE - ENTROPY , used in Line 7. We present re-sults for 2 different entropy estimators: simple binning andNSB. The simple binning estimator estimates the entropyusing a histogram. For comparison, we also used NSB(Nemenman, Shafee, and Bialek 2002), an entropy estima-tor designed to produce better estimates in the small-dataregime.
Computational Requirements
The main computationalbottleneck of our algorithm is the calculations of α -rank inLine 6 of Algorithm 1. In order to perform query selectionas in (2), we must compute the α -rank |S| × N e × N b times.For our experiments on the 4x4 Gaussian game this resultsin × ×
500 = 80 , computations of α -rank (setting N e = 10 , N b = 500 ), to select a payoff to query. Rela-tive to ResponseGraphUCB, our method thus requires sig-nificantly more computation in order to select a payoff toquery. However, in Empirical Game Theory, it is commonlyassumed that obtaining samples from the game is very com-putationally expensive (which is true in many potential prac-tical applications (Berner et al. 2019; Silver et al. 2017;Vinyals et al. 2019)). The increased computation requiredby our method to select a payoff to sample should then havea negligible impact to the overall computation time required,but the increased sample efficiency could potentially lead tolarge speed-ups.We perform two simple optimizations when deploying thealgorithm in practice. To save computational cost, we ob-serve the same payoff N r times in Line 11 rather than once,similar to rarely-switching bandits (Abbasi-yadkori, P´al, andSzepesv´ari 2011). Moreover, the number of samples N b wecan use to estimate the entropy is limited due to the compu-tational cost of computing α -rank. In order to obtain betterdifferentiation between the entropy of beliefs arising fromsampling different payoffs, we heuristically perform condi-tioning in Line 5 N c times. See Appendix B for a more de-tailed discussion on this. While the query objective proposed in (2) is backed both byan appealing intuition and a theoretical argument (see Sec-tion 6), it can be expensive to evaluate due to the cost of ac-curate entropy estimation. To address this difficulty, we alsoinvestigate an alternative involving the Wasserstein distance.The objective we consider is arg max a E ˜ m t ∼ ˜ M t [ W ( P ( r | H t = h t ) ,P ( r | H t = h t , A t = a, ˜ M t ( a ) = ˜ m t ))] . (3) Since the computation of Wasserstein distance from empiri-cal distributions can be achieved by solving a linear program(Bonneel et al. 2011), Equation (3) naturally lends itself tobeing approximated via samples. In our implementation, weuse POT (Flamary and Courty 2017) to approximate this dis-tance.igure 2: Diagram depicting the current belief (Blue) and 2different hallucinated beliefs (Red). We are assuming a dis-crete distribution over α -ranks, where the belief is uniformacross the relevant circles.The Wasserstein distance is built on the notion of cost,which allows a practitioner the opportunity to supply addi-tional prior knowledge. In our case, since α -ranks are prob-ability distributions, a natural way to measure accuracy isto use the total variation distance, which corresponds to set-ting the cost to c ( x, y ) = (cid:107) x − y (cid:107) . On the other hand,in cases where we are interested in finding the relative or-dering of agents under the α -rank, an alternative cost suchas the Kendall Tau metric (Fagin et al. 2006) could be used.While we emphasize the ability of the Wasserstein diver-gence to work with any cost, we leave the empirical study ofnon-standard costs for future work.It is important to note that the objective in (3) is quali-tatively different to the information gain objective proposedin (2). Figure 2 provides a diagram illustrating a major dif-ference between the two objectives. The entropy for bothbelief distributions shown in red is the same. In contrast,the Wasserstein distance in (3) between the current belief inblue and the hallucinated belief in red is much smaller forthe distribution on the left compared to the distribution onthe right. Notions of Regret
We quantify the performance of ourmethod by measuring regret. Our main analysis relies onBayesian regret (Russo and van Roy 2018), defined as J Bt = 1 − E h t [ P ( r = r (cid:63) | H t = h t )] , (4)where we used r (cid:63) to denote the α -rank with the highest prob-ability under r at time t . In (10), the expectation is over real-izations of the observation model. Since J Bt , like all purelyBayesian notions, does not involve the ground truth pay-off, we need to justify its practical relevance. We do this bybenchmarking it against two notions of frequentist regret.The first measures how accurate the probability we assign tothe ground truth r GT = f ( M (cid:63) ) is J Ft = 1 − E h t [ P ( r = r GT | H t = h t )] . (5)The second measures if the mean of our payoff belief, whichwe denote M µ , evaluates to the correct α -rank J Mt = 1 − E h t (cid:104) δ (cid:104) f ( M µ ) = r GT (cid:105)(cid:105) , (6)where the symbol δ (cid:104) predicate (cid:105) evaluates to 1 or 0 dependingon whether the predicate is true or false. In Section 7, weempirically conclude that these three notions of regret areclosely coupled in practice, changing at a comparable rate. Regret Bounds
As an intermediate step before discussinginformation gain on the α -ranks, we first analyze the behav-ior of a query selection rule which maximizes informationgain over the payoffs. π IGM ( a | H t = h t ) = arg max a I ( ˜ M t ; ( ˜ M t ( a ) , a ) | H t = h t ) . (7) The following result shows that using sampling strategy π IGM for T timesteps leads to a decay in regret of at least T e O ( − √ ∆ T ) , proving it will incur no regret as T → ∞ . Proposition 1 (Regret Bound For Information Gain on Pay-offs) . If we select actions using strategy π IGM , the regret attimestep T is bounded as J BT ≤ J FT = 1 − E h T [ P ( r = r GT | H T = h T )] ≤ T e g ( T ) (8) where g ( T ) = O ( − √ ∆ T ) . The proof, and an explicit form of g are found in sup-plementary material. We now proceed to our second result,where we maximize information gain on the α -ranks di-rectly. Consider a querying strategy that is an extension of(1) to T -step look-ahead, defined as π IGR = arg max a ,...,a T I ( r ; ( ˜ M ( a ) , a ) , . . . , ( ˜ M T ( a T ) , a T )) . (9) We quantify regret achieved by π IGR in the proposition be-low.
Proposition 2 (Regret Bound For Information Gain on Be-lief over α -Ranks) . If we select actions using strategy π IGR ,regret is bounded as J BT = 1 − P ( r = r (cid:63) | H T = h T ) → as T → ∞ . Proposition 2 provides a theoretical justification forquerying the strategies that maximize information gain onthe α -ranks. A more explicit regret bound (similar to Propo-sition 1) and the proof are provided in Appendix E. In prac-tice, to avoid the combinatorial expense of selecting actionsequences using π IGR , we use the greedy query selectionstrategy in equation (1). While the regret result above doesnot carry over, this idealized setting at least provides somejustification for information gain as a query selection crite-rion.
In this section, we describe our results on synthetic games,graphing the Bayesian regret J Bt described in Section 6.We also justify the use of Bayesian regret, showing that itis highly coupled with the ground truth payoff. We bench-mark two versions of our algorithms, α IG (Bins) and α IG(NSB), which differ in the employed entropy estimator. Wecompare to three baselines:
RG-UCB , a frequentist banditalgorithm (Rowland et al. 2019),
Payoff , which maximizesthe information gain about the payoff distribution, and
Uni-form , which selects payoffs uniformly at random.
RG-UCB represents the current SOTA in this domain,
Payoff repre-sents the performance of a Bayesian method that does nottake into account the structure of the mapping between pay-offs and α -ranks, and Uniform provides a point of referenceigure 3: Payoff matrices for 2 Good, 2 Bad (Left) and 3Good, 5 Bad (Right). Best viewed in color.as the simplest/most naive method . A detailed explanationof the experimental setup and details on the used hyperpa-rameters are included in Appendix C. Good-Bad Games
To investigate our algorithm, we studytwo environments whose payoffs are shown in Figure 3. Westart with the relatively simple environment with 4 agents.Figure 3 (Left) shows the expected payoffs, which we can in-terpret as the win-rate. Samples are drawn from a Bernoullidistribution with the appropriate mean. We refer to the envi-ronment as ‘2 Good, 2 Bad’ since agents 1 and 2 are muchstronger than the other 2 agents, winning of the gamesagainst them. Since the ordering between agents 3 and 4has no effect on the α -rank, gathering samples to determinethis ordering (highlighted in Purple) does not affect the be-lief distribution over α -ranks. Furthermore, since we treatthis as a 1-population game, the entries highlighted in Greenwhere each agent plays against themselves do not affect the α -rank. Entries that are necessary to determine the orderingbetween agents 1 and 2 are the most relevant for the α -rankand are highlighted in Red. Since agent 2 is slightly betterthan agent 1, the true α -rank is (0 , , , . However, it canbe difficult to determine the correct ordering between agents1 and 2 without drawing many samples from these entries.The game thus provides a model for the common scenarioof agents with clustered performance ratings. Focusing on Relevant Payoffs
Figure 4 presents the be-havior of our method and RG-UCB on this task. As ex-pected, RG-UCB splits its sampling between the Red entriesand the Purple entries, whereas our method concentrates itssampling much more significantly on the relevant entries,determining the ordering between agents 1 and 2. This isbecause, in contrast to our method, RG-UCB aims to cor-rectly determine the ordering between all entries used in thecalculating of α -rank, irrespective of whether they matter forthe final outcome. Wasserstein Payoff Selection Does Well
Comparing theWasserstein Criterion with Information Gain payoff section,we can see that it enjoys better concentration of the samplingon the Red entries, and improved performance towards theend of training. Appendix D provides a more detailed anal-ysis of this. We do not include Uniform on the regret graphs, since there isno reasonable value we could compute for it. Code is available at github.com/microsoft/InfoGainalpharank.
Bayesian and Frequentist Regret Go Down
Figure 5shows the resulting performance of the methods on this task,measured by the regret. Due to the relative simplicity of thegame, there is limited benefit to our method over RG-UCB,but there is a clear benefit over more naive methods that sys-tematically or uniformly sample the entries. We can see thatthe Bayesian regret J Bt and Frequentist regrets J Ft and J Mt are highly correlated, providing empirical justification forminimizing J Bt and validating that our method is concen-trating on the ground truth. Comparing Entropy Estimators
We also investigate alarger scale version of 2 Good, 2 Bad with 3 good and 5bad agents. Figure 6 shows the results, demonstrating a clearbenefit for our method using the Binning estimator for theInformation Gain or the Wasserstein objective. The perfor-mance of the NSB entropy estimator is not surprising giventhe significantly larger nature of this task compared to ‘2Good, 2 Bad’. A necessary part of the NSB estimator isan upper-bound on the total number of atoms in the dis-tributions, for which we only have a crude approximationthat grows exponentially with the size of the payoff matrix.Figure 7 shows the proportion of entries sampled for α IG(Bins), the Wasserstein objective, and RG-UCB. Once again,RG-UCB spends a significant part of its sampling budget de-termining the ordering between agents that do not have aneffect on the α -rank of the game (in this task agents 3 to8). In contrast, our methods concentrate their sampling onthe Red entries that determine the payoffs between the top 3agents, and hence the true α -rank. In general, our algorithmdoes not depend as much on accurate estimates on entropybut on identifying the distribution with a lowest entropy, forwhich the NSB estimator isn’t tuned. Incorporating Prior Knowledge
A large benefit of ourBayesian-based approach is the ability to incorporate priorknowledge and modelling assumptions into our model in aprincipled manner. To demonstrate the benefits, we incorpo-rate the following prior knowledge into both our algorithmand RG-UCB: 1) M ( σ, τ ) + M ( τ, σ ) = 1 . 2) Entries intheir respective blocks are equal to each other (except forthe top left block). A detailed description of the setup is in-cluded in Appendix C. Figure 8 compares the performanceof α IG, α Wass, and RG-UCB on 3 Good, 5 Bad when utiliz-ing this prior knowledge. We can see that our approach sig-nificantly outperforms RG-UCB on this task, further demon-strating the importance of our direct information gain objec-tive. The results also show significantly improved sample ef-ficiency over the results in Figure 6, demonstrating that our α IG and α Wass are able to efficiently take advantage of theprior knowledge supplied.
We described α IG, an algorithm for estimating the α -rankof a game using a small number of payoff evaluations. α IGworks by maximizing information gain. It achieves compet-itive sample efficiency and allows a way of building in priorknowledge about the payoffs.igure 4: Proportion of entries sampled on 2 Good, 2 Bad for different methods and objectives.Figure 5: Results for 2 Good, 2 Bad. Graphs show the mean and standard error of the mean over multiple runs (shown inbrackets) of 10 repeats each.Figure 6: Results for 3 Good, 3 Bad. Graphs show the mean and standard error of the mean over multiple runs (shown inbrackets) of 10 repeats each. Figure 7: Proportion of entries sampled on 3 Good, 5 Bad.Figure 8: Results on 3 Good, 5 Bad when incorporating prior knowledge into the models. cknowledgements
We thank the Game Intelligence group at Microsoft Re-search Cambridge for their useful feedback, support, andhelp with setting up computing infrastructure. TabishRashid is supported by an EPSRC grant (EP/M508111/1,EP/N509711/1).
References
Abbasi-yadkori, Y.; P´al, D.; and Szepesv´ari, C. 2011. Im-proved Algorithms for Linear Stochastic Bandits. InShawe-Taylor, J.; Zemel, R. S.; Bartlett, P. L.; Pereira,F.; and Weinberger, K. Q., eds.,
Advances in NeuralInformation Processing Systems 24 , 2312–2320. CurranAssociates, Inc. URL http://papers.nips.cc/paper/4417-improved-algorithms-for-linear-stochastic-bandits.pdf.Balduzzi, D.; Tuyls, K.; Perolat, J.; and Graepel, T. 2018.Re-evaluating evaluation. In
Advances in Neural Informa-tion Processing Systems , 3268–3279.Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.;Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.;et al. 2019. Dota 2 with Large Scale Deep ReinforcementLearning. arXiv preprint arXiv:1912.06680 .Bonneel, N.; Van De Panne, M.; Paris, S.; and Heidrich, W.2011. Displacement interpolation using Lagrangian masstransport. In
Proceedings of the 2011 SIGGRAPH Asia Con-ference , 1–12.Chen, S.; and Joachims, T. 2016. Modeling intransitivity inmatchup and comparison data. In
Proceedings of the ninthacm international conference on web search and data min-ing , 227–236.Daskalakis, C.; Goldberg, P. W.; and Papadimitriou, C. H.2009. The complexity of computing a Nash equilibrium.
SIAM Journal on Computing
The rating of chessplayers, past andpresent . Arco Pub.Fagin, R.; Kumar, R.; Mahdian, M.; Sivakumar, D.; and Vee,E. 2006. Comparing partial rankings.
SIAM Journal on Dis-crete Mathematics
Cambridge, Massachusetts
Advances in neural infor-mation processing systems , 569–576.Jordan, P. R.; Vorobeychik, Y.; and Wellman, M. P. 2008.Searching for approximate equilibria in empirical games. In
Proceedings of the 7th international joint conference on Au-tonomous agents and multiagent systems-Volume 2 , 1063–1070.Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.;Tuyls, K.; P´erolat, J.; Silver, D.; and Graepel, T. 2017. A uni-fied game-theoretic approach to multiagent reinforcementlearning. In
Advances in Neural Information ProcessingSystems , 4190–4203. Minka, T.; Cleven, R.; and Zaykov, Y. 2018. Trueskill 2: Animproved bayesian skill rating system.
Tech. Rep. .Muller, P.; Omidshafiei, S.; Rowland, M.; Tuyls, K.; Pero-lat, J.; Liu, S.; Hennes, D.; Marris, L.; Lanctot, M.; Hughes,E.; Wang, Z.; Lever, G.; Heess, N.; Graepel, T.; and Munos,R. 2020. A Generalized Training Approach for Multi-agent Learning. In
International Conference on Learn-ing Representations . URL https://openreview.net/forum?id=Bkl5kxrKDr.Nash, J. 1951. Non-cooperative games.
Annals of mathe-matics
Advances in neural informationprocessing systems , 471–478.Omidshafiei, S.; Papadimitriou, C.; Piliouras, G.; Tuyls, K.;Rowland, M.; Lespiau, J.-B.; Czarnecki, W. M.; Lanctot, M.;Perolat, J.; and Munos, R. 2019. α -rank: Multi-agent evalu-ation by evolution. Scientific reports
Advances in NeuralInformation Processing Systems , 12270–12282.Russo, D.; and van Roy, B. 2018. Learning to optimize viainformation-directed sampling.
Operations Research
Proceedings of the 25th international conference on Ma-chine learning , 880–887.Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai,M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel,T.; et al. 2017. Mastering chess and shogi by self-play witha general reinforcement learning algorithm. arXiv preprintarXiv:1712.01815 .Simone, M. ???? ndd - Bayesian entropy estimation fromdiscrete data. URL https://github.com/simomarsili/ndd.Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M.2009. Gaussian process optimization in the bandit set-ting: No regret and experimental design. arXiv preprintarXiv:0912.3995 .Su, X.; and Khoshgoftaar, T. M. 2009. A survey of collabo-rative filtering techniques.
Advances in artificial intelligence
Au-tonomous Agents and Multi-Agent Systems
Optimal transport: old and new , volume338. Springer Science & Business Media.Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.;Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds,T.; Georgiev, P.; et al. 2019. Grandmaster level in Star-Craft II using multi-agent reinforcement learning.
Nature
International Workshop on Agent-Mediated ElectronicCommerce , 109–123. Springer.Wellman, M. P. 2006. Methods for empirical game-theoreticanalysis. In
AAAI , 1552–1556.Yang, Y.; Tutunov, R.; Sakulwongtana, P.; and Ammar,H. B. 2019. α α -Rank: Practically Scaling α -Rank throughStochastic Optimisation. arXiv preprint arXiv:1909.11628 . Additional α -Rank Background In this section we include a worked example on how to calculate the α -Rank in the single-population setting in the infinite- α regime. The purpose of this example is to help build intuition about α -Rank in our particular setting. We will use 2 Good, 2Bad matrix game, shown in Figure 9 as an example, where the payoffs represent the expected win rate between agents.Figure 9: Payoff matrices for 2 Good, 2 Bad with the strategies for each player labeled.Note that in this example, we are considering a player playing the game against an identical opponent. The strategy spaceis S = { G , G , B , B } , with G and G representing the good agents, and B and B the bad agents. The good agents always win against the bad agents (hence M ( G , B
1) = 1 and M ( B , G
1) = 0 to use G and B as examples). The α -Rankis then a probability vector r ∈ ∆ − ⊂ R .In order to compute the α -Rank, we must first construct the Markov Chain whose nodes are elements of S .Figure 10 is a diagram representing this Markov Chain.Figure 10: Markov Chain constructed to compute the α -Rank. The transition probabilities between the nodes are representedusing the width of the arrow.The 4 nodes are the the elements of S = { G , G , B , B } , and the transition probabilities are as defined in Section 2.The width of the edges between the nodes in Figure 10 represent the magnitude of their transition probabilities. Since the good agents consistently beat the bad agents there is a large transition probability from nodes B and B to nodes G and G .Likewise there is a very small transition probability from nodes G , G to B , B . Since strategies B and B are equallymatched, there is an equal probability of transition from B to B as there is from transitioning from B to B . Cruciallythough, that transition probability is significantly smaller than the transition probabilities from B and B to nodes G and G .All nodes’ self-transition probabilities ensure that there is a valid transition matrix for the Markov Chain. Importantly, B and B have small self-transition probabilities whereas G has a very large self-transition probability.We know ( | S | − − = , and letting (cid:15) > be small, we can compute the exact transition matrix: G G B B G − (cid:15) ) / − (cid:15) ) / (cid:15)/ (cid:15)/ G − (cid:15) ) / − (cid:15) (cid:15)/ (cid:15)/ B − (cid:15) ) / − (cid:15) ) / (cid:15) ) / / B − (cid:15) ) / − (cid:15) ) / / (cid:15) ) / Intuitively, we can see that the transition probabilities are all leading to node G . So we would expect to spend a largeproportion of time at node G , if we were traversing the graph according to the transition probabilities. Hence, G is a stronger strategy than the others and so we would expect the α -Rank to reflect that.This is formalised in the computation of α -Rank by considering the unique stationary distribution of the Markov Chain. The α -Rank is then (0 , , , ( G , G , B , B ) as (cid:15) → . Implementation Details α IG (Bins).
For this binning entropy estimator we split [0 , into 101 equal bins of width . (implemented by rounding to the nearestsecond decimal place). We then estimated the entropy using a histogram. α IG (NSB).
The NSB estimator requires an upper bound on the total number of atoms, but since we do not know the true upper bound weutilize an estimate on the total number of possible α -ranks, which we describe below. We use the open-source implementationprovided in (Simone). Upper bound on number of α -ranks In the infinite- α regime there are a finite number of possible α -ranks. This is because only the ordering between relevant entriesin the payoff matrix changes the transition matrix of the Markov Chain produced in the computation of α -rank (Rowland et al.2019).Let there be k populations each with S strategies. Then there are S k strategies considered and so the transition matrix of theMarkov Chain has S k rows, one for each of the possible joint-strategies.Each possible joint-strategy σ can transition to at most k ( S − other strategies τ (cid:54) = σ . The probability of a self-transitionis uniquely determined based on these probabilities.This gives at most ( k ( S − unique values for that row.There are then [2 ( k ( S − ] S k = 2 S k ( k ( S − unique transition matrices. Thus, the possible number of unique α -ranks is upper-bounded by S k ( k ( S − . This bound is not tight, since there are many transition matrices with equal stationary distributions.In our experiments with K = 1 this gives S ( S − . Conditioning of the belief distribution
In our experiments we found that setting N c = 1 as suggested by theory is not always sufficient and use N c = 100 for allexperiments.Figure 11: Comparing the values of the objectives for each entry after sampling 5 values for every entry. Top shows the resultsfor N c = 1 . Bottom shows the results for N c = 100 . Mean and standard deviation are plotted across 10 seeds, maximum entryis highlighted in black. The mean and standard deviation of the (estimated) entropy of the current belief distribution is alsoplotted as a dashed horizontal line.After drawing a sample m (cid:48) t ∼ M t + (cid:15) , we then condition our belief distribution over α -rank on this sample N c times andthen approximate the Entropy of the resulting hallucinated belief distribution (or the Wasserstein distance between the currentbelief and the hallucinated belief). Theory suggests that setting N c = 1 is sufficient, however empirically we found that this didnot produce satisfactory results. Figure 11 shows that only conditioning once produces very little separation between the valuesfor the different entries. Additionally, we can see that there is very little separation between the current belief’s entropy and theigure 12: Comparing the values of the objectives for each entry after sampling 5 values for every entry, and then additionallysampling 250 values for the Red entries. Top shows the results for N c = 1 . Bottom shows the results for N c = 100 . Mean andstandard deviation are plotted across 10 seeds, maximum entry is highlighted in black. The mean and standard deviation of the(estimated) entropy of the current belief distribution is also plotted as a dashed horizontal line.hallucinated belief’s entropy. In contrast, we can see that conditioning times produces significantly more separation. Figure12 shows the same trend, after additionally sampling 250 values for the red entries. The Wasserstein objective shows the sametrend, that conditioning more than once produces significantly more separation. A Wasserstein Distance of 0 indicates that thetwo distributions are identical. C Experimental Setup α -Rank In the computation of α -rank we set (cid:15) = 10 − in all of our experiments. Baselines
ResponseGraphUCB , uses a Hoeffding Bound to construct the confidence interval: ( µ − (cid:113) log(2 /δ )( b − a )2 N , µ + (cid:113) log(2 /δ )( b − a )2 N ) . Where δ is the confidence hyperparameter of the algorithm, b is the maximum valuean entry can take, a is the minimum value, and N is the number of times a value has been seen for an entry. For all experimentswe swept over δ ∈ { . , . , . , . , . , . , . } , and the final value is selected by considering the area under the curvefor − P ( f ( ¯ M ) = r GT ) . Uniform . The entry to sample if picked uniformly from all possible entries.
Payoff . The entry which maximises the information gain between its sample and the payoff distribution is chosen. For anisotropic Gaussian this is equivalent to picking the entry which the lowest count, which results in systematic sampling of eachentry. For a non isotropic Gaussian the same procedure as (Srinivas et al. 2009) is used.
Graphs − P ( f ( ¯ M ) = r GT ) . At each timestep we compute the α -rank of the mean payoff matrix. Equality is determined if | f ( ¯ M ) = r GT ) | < . . The choice of . is largely arbitrary, we did not find the results to be sensitive to this. − P ( r GT ) . 100 times during training (evenly spaced), we sample samples from the current belief distribution over α -ranks. P ( r GT ) is determined from these samples (which are aggregated by rounding each value to the nearest d.p. )by counting the number of sampled α -ranks r such that | r − r GT | < . . − P ( r ∗ ) , is determined similarly to − P ( r GT ) , except we use the samples to calculate the mode.For ResponseGraphUCB, we construct a distribution over the payoff entries as being uniform over the confidence intervals. Environments2 Good, 2 Bad
Observations are sampled from
Ber ( x ) , where x is the value in the payoff matrix. IG(Bins), α IG(NSB), α Wass . Prior used is N ( µ , σ ) , with aleatoric noise σ A . µ = 0 . . Swept over σ ∈ { . , } , σ A ∈{ . , . , } . 20 samples are used to approximate the expectation, N e = 20 . 1000 samples are drawn from the belief distribu-tion(s) to approximate the quantities inside the expectation, N b = 1000 . We set N r = 10 .For all 3 methods we set σ = 1 . For α IG (Bins) and α IG (NSB) we set σ A = 0 . , and for α Wass we set σ A = 0 . . ResponseGraphUCB . We set δ = 0 . . Maximum value is 1, minimum value is 0. Observations are sampled from
Ber ( x ) , where x is the value in the payoff matrix. α IG(Bins), α IG(NSB), α Wass . Prior used is N ( µ , σ ) , with aleatoric noise σ A . µ = 0 . . Swept over σ ∈ { . , } , σ A ∈{ . , . , } . N e = 10 . N b = 500 . N r = 500 .For all 3 methods we set σ = 1 . For α IG (Bins) and α IG (NSB) we set σ A = 0 . , and for α Wass we set σ A = 0 . . ResponseGraphUCB . We set δ = 0 . . To match the games considered in our theoretical analysis, Observations are sampled from N ( x, and then clipped to bewithin of x , where x is the value of the entry in the payoff matrix. The values of x are uniformly drawn from [0 , . α IG(Bins), α IG(NSB), α Wass . Prior used is N ( µ , σ ) , with aleatoric noise σ A . µ = 0 . . Swept over σ ∈ { . , } , σ A ∈{ . , } . N e = 10 . N b = 500 . N r = 100 . For α IG(Bins) we set σ = 1 and σ A = 1 . For α IG(NSB) we set σ = 0 . and σ A = 0 . . For α Wass we set σ = 1 and σ A = 0 . . ResponseGraphUCB . We set δ = 0 . . Maximum value is 2, minimum value is -1. α IG(Bins), α Wass . Aleatoric noise σ A is set to . and the mean µ = 0 . . µ = 0 . . Swept over σ A ∈ { . , . , } . N e = 10 . N b = 500 . N r = 100 .The kernel K we use for the GP is specified as follows:In order to encode the prior knowledge that elements within a block (except the top left block) are equal, we partition the payoffmatrix. A strategy σ = ( x, y ) is a member of block b if ≤ x ≤ and ≤ y ≤ , if it is a payoff between a Good agent andanother Good agent. Block b if ≤ x ≤ and y > , a payoff between a Good agent and a Bad agent. Block b if ≤ y ≤ and x > , a payoff between a Bad agent and a Good agent. Block b otherwise, if it is a payoff between a Bad agent and a Badagent.The kernel k encoding block-wise equality is then defined as: k ( b i , b j ) = 1 for all i (cid:54) = j . k ( b i , b i ) = 0 for i ∈ { , , } . k ( b , b ) = 1 .To additionally encode anti-symmetry (about the mean µ = 0 . ) we then produce a new kernel k (cid:48) from k as follows:For a strategy σ = ( x, y ) , define σ t := ( y, x ) the transpose of σ . We wish to encode that M ( σ ) = 1 − M ( σ t ) . k (cid:48) ( σ, τ ) = k ( σ, τ ) + k ( σ t , τ t ) − k ( σ, τ t ) − k ( σ t , τ ) .The finished kernel K is then defined as K := ( k (cid:48) ) T ( k (cid:48) ) to ensure it is positive definite. The entries are then divided by 500to ensure a suitable magnitude for the variance. ResponseGraphUCB . We set δ = 0 . .In order to incorporate the same modelling assumptions into RG-UCB, for every real sample we receive from the payoffmatrix, we pretend to receive an appropriate sample for the relevant entries.After receiving a payoff p ∈ { , } for strategy σ , we then pretend to receive: − p for σ t to encode anti-symmetry. p for all τ in the same block as σ (except the top-left block), where blocks are defined the same as for the kernel k specifiedabove. D Further Results
Gaussian Games
Figure 13 shows the results on 4x4 games with Gaussian noise, demonstrating improved performance across all 3 regret met-rics for the α IG (Bins). This is empirical confirmation of our theoretical results, and shows that our method achieves betterperformance compared to RG-UCB on general games.
Figure 14 shows the values used by the different objectives during training. The top row shows the values after sampling 5values for each entry, showing a clear seperation between the Red entries and the rest. The bottom row shows the values afterigure 13: Results for the 4x4 Gaussian Game. Graphs show the the mean and standard error of the mean over multiple runs(shown in brackets) of 10 randomly sampled games.additionally sampling 250 values for the Red entries. We can then see a large difference between the Wasserstein and Entropy-based objectives. As desired the Wasserstein-based objective shows a large separation between the Red entries and the others,additionally assigning the smallest values to the irrelevant Green, and Purple entries.Figure 14: Value of the objectives for each entry after sampling 5 values for every entry (top) and additionally sampling 250values for the red entries (below). Mean and standard deviation are plotted across 10 seeds, maximum entry is highlighted inblack. igure 15: Proportion of entries sampled on 2 Good, 2 Bad for more seeds.igure 16: Values of the objectives for each entry on ‘3 Good, 5 Bad’ after sampling 5 values for every entry. Mean and standarddeviation across 10 seeds is shown, maximum highlighted in black.Figure 17: Values of the objectives for each entry on ‘3 Good, 5 Bad’ after sampling 5 values for every entry, and additionallysampling 1000 values for the red entries. Mean and standard deviation across 10 seeds is shown, maximum highlighted in black.igure 18: Proportion of entries sampled on 3 Good, 5 Bad for more seeds.
Proofs
Notation
As a reminder, we reintroduce notation that is relevant to this section. M (cid:63) is the true payoff vector, which is unknown to us. M is our prior distribution over the entries of the payoff vector, represented by a GP.The GP models noise in the observations of the payoff as ˜ M = M + (cid:15) , where (cid:15) ∼ N (0 , Iσ A ) . We model the payoffs we receivefrom the real game at timestep t when taking action a t , as m t ∼ M (cid:63) ( a t ) + (cid:15) (cid:48) t . Where, (cid:15) (cid:48) t is i.i.d. and has support on the interval [ − σ A , σ A ] . Clipped Noise
Note that it is important the observation noise (cid:15) (cid:48) t is clipped, since it allows us to apply Lemma 1 which isan existing result from Srinivas et al. (2009, Theorem 6). In that paper, Srinivas et al. (2009) assume the noise terms (cid:15) (cid:48) t areuniformly bounded by σ A which is equivalent to all (cid:15) (cid:48) t having support on the interval [ − σ A , σ A ] . Since this assumption isshared by the seminal paper (Srinivas et al. 2009), we do not believe it to be overly restrictive in our theoretical analysis. Regret
We quantify the performance of our method by measuring regret. Our main analysis relies on Bayesian regret (Russoand van Roy 2018), defined as J Bt = 1 − E [ P ( r = r (cid:63) | H t = h t )] , (10)where the expectation is taken over the following:• Our prior distribution, M t , representing our uncertainty over the true unknown payoff vector M (cid:63) at timestep t .• The randomness in the actions we have taken and the corresponding observations we have received up to timestep t . Theseare encoded by our history H t , in which a particular realization is h t = a , m , ..., a t , m t . a t ∼ A t , our distribution overactions to take at timestep t , and m t ∼ M (cid:63) ( a t ) + (cid:15) (cid:48) t our clipped noise model when interacting with the game.In this formulation r (cid:63) is used to denote the α -rank with the highest probability under r at time t , where r is the distributionover α -ranks according to the prior, P ( r ) = P ( M t ∈ f − ( r )) . Since J Bt , like all purely Bayesian notions, does not involve theground truth payoff, we need to justify its practical relevance. We do this by benchmarking it against two notions of frequentistregret. The first measures how accurate the probability we assign to the ground truth r GT = f ( M (cid:63) ) is J Ft = 1 − E h t [ P ( r = r GT | H t = h t )] . (11)The second measures if the mean of our payoff belief, which we denote M µ , evaluates to the correct α -rank J Mt = 1 − E h t (cid:104) δ (cid:104) f ( M µ ) = r GT (cid:105)(cid:105) , (12)where the symbol δ (cid:104) predicate (cid:105) evaluates to 1 or 0 depending on whether the predicate is true or false. For both these notions ofregret the expectation is taken only over the history h t ∼ H t . Permutation Property
We begin by explicitly stating a property of the infinite- α version of α -rank. The function f computingthe α -rank satisfies the permutation property, defined as π ( M ) = π ( M ) = ⇒ f ( M ) = f ( M ) . (13)Here, π ( M ) denotes the ordering of the elements of the vector M using the standard ≥ operation on real numbers. This is thesame property exploited by frequentist analysis by Rowland et al. (2019). Letting R := R |S| , be the space of all valid α -ranks,Property (13) implies that R is a finite set and | R | ≤ N ! , (14)where N := |S| the number of pure strategies/actions. Note that our proofs consider the general multi-population case of α -rank, and our not restricted to just the single population scenario. Separability Assumption
Similarly to the work of Rowland et al. (2019), we limit ourselves to payoffs that are distinguish-able in order to make α -rank robust to small changes in the payoffs. We assume that there exists a constant ∆ > such that forall payoff indices i , j | M (cid:63) ( i ) − M (cid:63) ( j ) | ≥ ∆ . (15) nformation Gain and Entropy We recall a formula for the information gain in terms of the entropy: I ( r ; ( ˜ M t ( a ) , a ) | H t = h t ) = H ( r | H t = h t ) − H (cid:16) r | H t = h t , A t = a, ˜ M t ( a ) (cid:17) (16) = H ( r | H t = h t ) − E ˜ m t ∼ ˜ M t ( a ) (cid:104) H (cid:16) r | H t = h t , A t = a, ˜ M t ( a ) = ˜ m t (cid:17)(cid:105) . (17) Regret Bound For Policy Maximizing Information Gain on Payoffs
We now show a regret bound for a policy that maxi-mizes information gain on the payoffs. Define: π IGM ( a | H t = h t ) = arg max a I ( ˜ M t ; ( ˜ M t ( a ) , a ) | H t = h t ) , (18)as the policy which selects the action that maximises the information gain on the payoffs (given any history H t and prior M t ).Let H IGM T denote the history when following π IGM for T timesteps. Proposition 1 [Regret Bound For Information Gain on Payoffs] If we select actions using strategy π IGM , the regret at timestep T is bounded as J BT ≤ J FT ≤ − E h T ∼ H IGM T [ P ( r = r GT | H T = h T )] ≤ T e g ( T ) where g ( T ) = O ( − √ ∆ T ) . (19) Proof.
We know that P ( r = r (cid:63) | H T = h T ) ≥ P ( r = r GT | H T = h T ) since r (cid:63) is defined as the α -rank with the highestprobability under r and time t .Thus, J BT ≤ J FT for any history.Fix a history h T . By assumption of separability, we have P ( r = r GT | H T = h t ) ≥ P (cid:18) | M t − M (cid:63) | ∞ ≤ ∆2 (cid:19) . (20)We now use concentration results for Gaussian Processes. Specifically, we invoke Corollary 1, stated later, together with anexplicit formula for g ( T ) .This proves J FT ≤ T e g ( T ) , ending our proof. Regret Bound For Policy Maximizing Information Gain on α -Ranks We move on to show a bound for a policy thatmaximizes information gain on the α -ranks. Define: π IGR = arg max a ,...,a T I ( r ; ( ˜ M ( a ) , a ) , . . . , ( ˜ M T ( a T ) , a T )) , (21)as the policy which maximizes information gain on the α -ranks directly. Note that this is an extension of (1) to T -step look-ahead.Let H IGR T denote the history when following π IGR up to timestep T . Denote by h b ( p ) = − ( p log p + (1 − p ) log(1 − p )) theentropy of a Bernoulli random variable with parameter p , and denote h − b as the inverse of the restriction of h b to the interval [1 / , . Proposition 2 Expanded [Regret Bound For Information Gain on Belief over α -Ranks] If we select actions using strategy π IGR , the bayesian regret is bounded as J BT = 1 − E h T ∼ H IGR T [ P ( r = r (cid:63) | H T = h T )] ≤ − δ (cid:104) T e g ( T ) ≤ ( N log N ) − (cid:105) (cid:20) h − b ( E h T ∼ H IGM T (cid:104) h b (1 − T e g ( T ) ) + T e g ( T ) N log N (cid:105) ) (cid:21) , (22)and g ( T ) is as in Proposition 1. Proof.
We start by bounding the entropy of the α -rank distribution. Let the abbreviation p (cid:63) = P ( r = r (cid:63) | H T = H IGM T ) .We have H (cid:0) r | H IGR T (cid:1) ( a ) ≤ H (cid:0) r | H IGM T (cid:1) ( b ) ≤ E h T ∼ H IGM T [ h b ( p (cid:63) ) + (1 − p (cid:63) ) log( | R | )] ( c ) ≤ E h T ∼ H IGM T [ h b ( p (cid:63) ) + (1 − p (cid:63) ) N log N ] . ere, (a) follows from the definition of π IGR and equation (17), (b) follows by Lemma 3 and (c) holds because | R | ≤ N ! byEquation (14). Combining the above with the bound − p (cid:63) ≤ T e g ( T ) from Proposition 1, we have H (cid:0) r | H IGR T (cid:1) ≤ E h T ∼ H IGM T [ h b ( p (cid:63) ) + (1 − p (cid:63) ) N log N ] (23) ≤ E h T ∼ H IGM T (cid:104) h b ( p (cid:63) ) + T e g ( T ) N log N (cid:105) . (24)Let us now assume that T e g ( T ) ≤ / , since we are interested in the behaviour of our regret bound as T → ∞ , andwe know that as T → ∞ , T e g ( T ) → . If T e g ( T ) > / then we can trivially bound our expression above by . Then p (cid:63) ≥ − T e g ( T ) ≥ / ⇒ h b ( p (cid:63) ) ≤ h b (1 − T e g ( T ) ) .We now proceed to bound the probability of r (cid:63) in terms of the entropy of the α -ranks. We have h b ( P ( r = r (cid:63) | H IGR T = h t )) ≤ H (cid:0) r | H IGR T (cid:1) . This, together with (24) and h b ( p (cid:63) ) ≤ h b (1 − T e g ( T ) ) implies h b ( P ( r = r (cid:63) | H IGR T = h t )) ≤ E h T ∼ H IGM T (cid:104) h b (1 − T e g ( T ) ) + T e g ( T ) N log N (cid:105) . (25)Since the codomain of h b is [0 , , we must introduce additional restrictions in order to be able to invert the function.To ensure h b (1 − T e g ( T ) ) + T e g ( T ) N log N ≤ we restrict our analysis to when T e g ( T ) ≤ ( N log N ) − . Note that thissubsumes our earlier restriction of T e g ( T ) ≤ / . Again, we can trivially bound our final expression above by 1, should thiscondition not be met.We denote by h − b the inverse of the restriction of h b to the interval [1 / , . Note that h b ( x ) ≤ y = ⇒ x ≥ h − b ( y ) for x ∈ [1 / , , y ∈ [0 , . h b ( P ( r = r (cid:63) | H IGR T = h t )) ≤ E h T ∼ H IGM T (cid:104) h b (1 − T e g ( T ) ) + T e g ( T ) N log N (cid:105) (26) = ⇒ P ( r = r (cid:63) | H IGR T = h t ) ≥ h − b ( E h T ∼ H IGM T (cid:104) h b (1 − T e g ( T ) ) + T e g ( T ) N log N (cid:105) ) (27) = ⇒ − P ( r = r (cid:63) | H IGR T = h t ) ≤ − h − b ( E h T ∼ H IGM T (cid:104) h b (1 − T e g ( T ) ) + T e g ( T ) N log N (cid:105) ) . (28)Finally, we state our final regret bound incorporating our restrictions/assumptions we have made. − P ( r = r (cid:63) | H IGR T = h t ) ≤ − δ (cid:104) T e g ( T ) ≤ ( N log N ) − (cid:105) (cid:20) h − b ( E h T ∼ H IGM T (cid:104) h b (1 − T e g ( T ) ) + T e g ( T ) N log N (cid:105) ) (cid:21) . (29)This then proves the expanded form of the proposition.The simpler form of the Proposition in Section 6 then follows. This is because as T → ∞ , T e g ( T ) → and both h b (1 − T e g ( T ) ) , T e g ( T ) N log N → thus ensuring that h − b ( h b (1 − T e g ( T ) ) + T e g ( T ) N log N ) → .We use the following result by Srinivas et al. (2009, their Theorem 6), which we specialize in our notation. We use the termGaussian Process despite the fact that the index set is finite, since the model includes observation noise. Lemma 1 (Srinivas et al., Concentration for a Gaussian Process) . Consider a Gaussian Process M , with N indices. AssumeM uses a zero-mean prior with constant variance σ and observation noise σ A . The posterior process M t is obtained byconditioning on t observations. The observations are obtained as m t = M (cid:63) ( a t ) + (cid:15) (cid:48) t , where (cid:15) (cid:48) t are i.i.d random variables withsupport bounded by [ − σ , σ ] . Denote the RKHS norm of M (cid:63) under the GP prior with (cid:107) M (cid:63) (cid:107) k . Denote by γ (cid:63)t the maximuminformation gain about M obtainable in t timesteps. Then, for any ∆ > , and for any timestep t , we have P (cid:20) | M − M (cid:63) | ∞ ≤ ∆2 (cid:21) ≥ − te − (cid:118)(cid:117)(cid:117)(cid:116) (cid:18) ∆2 σ max t (cid:19) − (cid:107) M(cid:63) (cid:107) k γ(cid:63)t . (30)The above lemma requires knowledge of the RKHS norm and the maximum obtainable information gain. emma 2 (Worst-Case Constants) . For any kernel, we have (cid:107) M (cid:63) (cid:107) k ≤ σ − (cid:107) M (cid:63) (cid:107) and γ (cid:63)t ≤
12 log det( I + σ − A K ) . Moreover, for a strategy that maximizes information gain on payoffs, we have σ max t ≤ σ A σ √ σ A +( TN − σ . Proof.
The inequalities for posterior variance and the RHKS norm are obtained by using the independent kernel, which repre-sents the worst-case. The inequality for information gain follows by writing γ (cid:63)t = 12 log det( I + σ − A K )det( I + σ − A Σ) ≤
12 log det( I + σ − A K ) . (31)The inequality follows since the denominator is greater than one. Here, we denoted the prior covariance with K and the posteriorcovariance with Σ . Corollary 1.
For a strategy that maximizes the payoff information gain and for any time-step T , we have: P (cid:20) | M t − M (cid:63) | ∞ ≤ ∆2 (cid:21) ≥ − T e g ( T ) , where g ( T ) = O ( − √ ∆ T ) Specifically, g ( T ) = − (cid:118)(cid:117)(cid:117)(cid:116) (cid:0) ∆2 (cid:1) σ A +( TN − σ σ A σ − σ − (cid:107) M (cid:63) (cid:107) log det( I + σ − A K ) . Lemma 3 (Upper Bound on Entropy) . For any discrete random variable x with n outcomes, we have, for each outcome i H ( x ) ≤ h b ( p i ) + (1 − p i ) log( n − . Proof.
Without loss of generality, assume i = 1 . H ( x ) = − p log p − (cid:88) j> p j log( p j )= − p log p − ( n − (cid:88) j> n − p log( p j ) ( a ) ≤ − p log p − ( n − (cid:88) j> p j n − log (cid:88) j> p j n − = − p log p − (1 − p ) log (cid:18) − p n − (cid:19) = − p log p − (1 − p ) log (cid:18) − p n − (cid:19) = h b ( p j ) + (1 − p j ) log( n − There, (a) follows from Jensen’s inequality applied to the function x log xx