ScrofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning
SScrofaZero: Mastering Trick-taking Poker Game Gongzhu by DeepReinforcement Learning
Naichen Shi * 1
Ruichen Li * 2
Sun Youran * 3
Abstract
People have made remarkable progress in gameAIs, especially in domain of perfect informationgame. However, trick-taking poker game, as apopular form of imperfect information game, hasbeen regarded as a challenge for a long time.Since trick-taking game requires high level ofnot only reasoning, but also inference to excel, itcan be a new milestone for imperfect informationgame AI. We study Gongzhu, a trick-taking gameanalogous to, but slightly simpler than contractbridge. Nonetheless, the strategies of Gongzhuare complex enough for both human and com-puter players. We train a strong Gongzhu AI Scro-faZero from tabula rasa by deep reinforcementlearning, while few previous efforts on solvingtrick-taking poker game utilize the representationpower of neural networks. Also, we introducenew techniques for imperfect information gameincluding stratified sampling, importance weight-ing, integral over equivalent class, Bayesian in-ference, etc. Our AI can achieve human expertlevel performance. The methodologies in buildingour program can be easily transferred into a widerange of trick-taking games.
1. Introduction
We live in a world full of precariousness. Like a famousquotation from Sherlock Holmes, “I have a turn both forobservation and for deduction.”(Doyle), one should deducehidden information from seemingly random observationsto make good decisions. Imperfect information game isan abstraction of multi-agent decision making with privateinformation. Related theory has been found useful in auc-tion, mechanism design, etc (Tadelis, 2013). The researchof specific examples of imperfect information game can * Equal contribution IOE, University of Michigan EECS,Peking University Yau Mathematical Sciences Center, Ts-inghua University.. Correspondence to: Sun Youran < [email protected] > . strengthen our abilities to navigate through the uncertaintiesof the world.People have successfully built super-human AIs for perfectinformation games including chess (Silver et al., 2017) andGo (Silver et al., 2018) by using deep reinforcement learning.Also, by combining deep learning with imperfect sampling,researchers have also made huge progress in Mahjong (Liet al., 2020), StarCraft (Vinyals et al., 2019), Dota (OpenAIet al., 2019), and Texas hold’em (Brown et al., 2019).We study Gongzhu, a 4-player imperfect information pokergame. Gongzhu is tightly connected with a wide range oftrick-taking games. The detailed rules will be introduced insection 2. Building a strong Gongzhu program can deepenour understanding about imperfect information games.We study Gongzhu for three reasons. Firstly, Gongzhucontains medium level of randomness and requires carefulcalculations to reign supreme. Compared with Mahjong andTexas hold’em, Gongzhu is more complicated since its deci-sion space is larger, let alone toy poker games pervasivelystudied in like Leduc(Southey et al., 2005) and Kuhn(Kuhn,1950). What’s more, it is important to read the signals fromthe history of other players’ actions and update beliefs abouttheir private information continuously. The entanglementof sequential decision making and imperfect informationmakes it extremely nontrivial to find a good strategy out ofhigh degree of noise.Secondly, the scoring system of Gongzhu is relativelysimple compared with bridge, since players don’t bid inGongzhu. Thus the reward function is easier to design, andwe can focus on training highly skilled playing AI givensuch reward function.Thirdly, compared with large scale games like StarCraft orDota, Gongzhu is more computationally manageable: allexperiments in this paper can be done on only 2 Nvidia2080Ti GPUs!We train a Gongzhu program from tabula rasa by self playwithout any prior human knowledge other than game rules.Our algorithm is a combination of Monte Carlo tree searchand Bayesian inference. This method is an extension ofMCTS to imperfect information game. Our program defeats a r X i v : . [ c s . G T ] F e b crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning expert level human Gongzhu players in our online platformGongzhu Online.We summarize our contributions below:• We introduce the game of Gongzhu that is more diffi-cult than Leduc but more manageable than StarCraft.Gongzhu can be a benchmark for different multi-agentreinforcement learning algorithms.• We train a strong Gongzhu agent ScrofaZero purely byself-play. The training of ScrofaZero requires neitherhuman expert data nor human guidance beyond gamerules.• To our best knowledge, we are the first to combineBayesian inferred importance sampling with deep neu-ral network for solving trick-taking game. Our methodscan be transferred to other trick-taking games includingcontract Bridge.The paper is organized as follows. In section 2, we reviewthe rule of Gongzhu and its connection to other trick-takinggames. In section 3, we discuss related work in literature.In section 4, we present an overview of our framework. Insection 5, we show some of our key methodologies includingstratified sampling and integral over equivalent class. And inthe section 6, we analyze the results of extensive empiricalexperiments.
2. Rule of Gongzhu and Related Games
Before delving further into our program, we will use asection to familiarize readers the rules of Gongzhu. Wewill introduce rules from general trick-taking poker game tospecific game Gongzhu.Gongzhu belongs to the class trick-taking , which is a largeset of games including bridge, Hearts, Gongzhu and Shengji.For the readers unfamiliar with any of these games, seesupplementary material 8.1 for the common rules of tricktaking games. From now on, we assume readers understandthe concept of trick .The class trick-taking is divided into two major familiesaccording to the goal of the games: family plain-trick andfamily point-trick . In family plain-trick , like Whist, Con-tract bridge and Spades, the goal of games is to win specificor as many as possible number of tricks. In family point-trick , like Black Lady, Hearts, Gongzhu and Shengji, thegoal of games is to maximize the total points of cards ob-tained.For most of games in the family point-trick , only some cardsare associated with points. Depending on the point countingsystem, family point-trick is furthermore subdivided intotwo genus, evasion and attack. For genus evasion, most
Figure 1.
The classification of trick-taking games. Gongzhu be-longs to the genus evasion, the family point-trick. As shown in thisfigure, Gongzhu is tightly connected with a wide variety of games. of points are negative, so the strategy is usually to avoidwinning tricks, while genus attack the opposite. Gongzhubelongs to the genus evasion. The points in Gongzhu arecounted by following rules.1. Every heart is associated with points. Their points areshown in table 1. Notice that,• all points of hearts are non-positive;• the higher rank a heart card has, the greater itspoint’s absolute value will be;• the total points of hearts are -200.2. SQ has − points. SQ is called zhu (scrofa) in thisgame. As zhu has the most negative points, playerswill try their best avoid getting it.3. DJ has +100 points. DJ is called yang (sheep/goat), incontrast with zhu .4. C10 will double the points. However if one gets onlyC10, this C10 is counted as +50 . C10 is called trans-former for obvious reasons.5. If one player collects all the 13 hearts, to reward herbraveness and skill, the points of all the hearts willbe count as +200 rather than − . This is called allhearts . It is worth clarify that,• to get all hearts , one need get all 13 hearts, in-cluding zero point hearts H2, H3 and H4;• the points are counted separately for each playerand then summed together in each team.All the rules except all hearts are summarized in table 1.The classification of trick-taking games and where Gongzhu crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning belongs are shown in figure 1.HA − SQ − HK − DJ +100 HQ −
30 +50 or doubleHJ − C10 the pointsH5 to H10 − H2 to H4 0
Table 1.
Points of cards in Gongzhu. Gongzhu is a trick-takingclass, point-trick family, evasion genus game. Note that except forpoints shown in the above table, there is an extra all hearts ruleexplained in Section 2.
3. Literature review
Different from Go, Gongzhu is a dynamic incomplete infor-mation game (Tadelis, 2013). The research on the dynam-ics of imperfect information game and type of equilibriumhas a long history, for example (Selten, 1975) introducesa more refined equilibrium concept called trembling handequilibrium. Our work applies these theoretical analysisincluding Bayesian inference and sequential decision mak-ing to the trick-taking game Gongzhu. Recently, peopleproposed counterfactual regret minimization related algo-rithms (Zinkevich et al., 2007; Brown & Sandholm, 2019)that can be proved to find an (cid:15) -Nash equilibrium. The ap-plications of counterfactual regret minimization type algo-rithms have been found successful in Texas hold’em (Brownet al., 2019). Such algorithms are not directly applicableto Gongzhu, which has a larger decision space and longerdecision period.
As shown in section 2, Gongzhu is tightly connected withbridge. However unlike chess and go, computer bridge pro-grams cannot beat human experts yet. The recent winnersof the World Computer-Bridge Championship (Wikipedia,2021) are Wbridge5 (on 2018) and Micro Bridge (on2019) . Micro Bridge first randomly generates unknownhands under known conditions derived from history, thenapply tree search and pruning algorithms to make decisionunder perfect information. Wbridge5 does not reveal theiralgorithm to public, but it is believed to be similar to MicroBridge, i.e. human-crafted rules for bidding and heuristictree search algorithms for playing. Different from thesework, we use deep neural network to evaluate current situa- The 2020 Championship is cancelled. The next championshipwill be in 2021. tion and to generate actions. Recently, (Rong et al., 2019)built a bidding program by supervised learning and then byreinforcement learning. In contrast, we train our Gongzhuprogram from tabula rasa . The use of Monte Carlo tree search both as a good rolloutalgorithm and a stable policy improvement operator is popu-lar in perfect information game like go (Silver et al., 2018),and chess (Silver et al., 2017). (Grill et al., 2020) analyzedsome theoretical properties of MCTS used in AlphaGo zero.(Whitehouse, 2004; Browne et al., 2012) discusses popularMCTS algorithms, including information set Monte Carlotree search (ISMCTS), an algorithm that combines MCTSwith incomplete. The Monte Carlo tree search algorithmused in our program is the standard upper-confidence boundminimizing version, and computationally simpler than thefull ISMCTS.
The dynamics of training multi-player game AIs by self-playcan be complicated. People have found counter-exampleswhere almost all gradient-based algorithms fail to convergeto Nash equilibrium (Letcher, 2021). Also, some games arenontransitive for certain strategies (Czarnecki et al., 2020).Instead of a Nash equilibrium strategy, we attempt to finda strong strategy that can beat human expert level player.We also define a metric for the nontransitivity in the game.Different from (Letcher et al., 2019), where nontransitivityis defined only in the parameter space, our metric can bedefined for any strategy.
4. Framework
This section is an overview of our program. We start byintroducing some notations. As discussed above, Gongzhuis a sequential game with incomplete information. The players are denoted as N = { , , , } . We denote inte-ger u ∈ [0 , as the stage of game, which can also beunderstand as how many cards have been played. We define history h u ∈ H is a sequence of tuple { ( i t , a t ) } t =1 ,...u ,where t is stage, i t is the player to take action at stage t , and a t is the card her played. History represent as the historyof card playing up to time u . We sometimes replace h u by h for simplicity. We use h ( t ) denote the t -th tuple in h .History h is publicly observable to all players. It’s naturalto assume that each player can perfectly recall, i.e. they canremember the history exactly. After dealing of each round,players will have their initial hand cards c = { c i } i =0 , , , . c i is private information for player i, and other players can-not see it. Also, we denote c ui as the i -th player’s remaininghand cards at stage u . Action a ∈ A is the card to play.We set |A| = 52 . By the rule of Gongzhu, a player may crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning not be able to play all cards depending on the suits of thattrick, thus actual legal choice of a might be smaller than thenumber of remaining cards in one’s hand.Information set is a concept in incomplete information gametheory that roughly speaking, characterizes the informa-tion input for decision making. For standard Gongzhugame, we define the information set for player i at time u to be ( h u , c i ) , i.e. public information h u combined withplayer i ’s private information c i . The payoff r is calculatedonly at the end of the game, i.e. when all 52 cards areplayed. r ( h T ) = ( r ( h T ) , r ( h T ) , r ( h T ) , r ( h T )) (where T = 52 ) representing the scores for each player. A strategy of player i is a map from history and private information toa probability distribution in action space: π i : H × C i → A ,i.e. given a specific history h and an initial card c i , π i ( c i , h ) chooses a card to play. A strategy profile π is the tuple of4 players’ strategies π = ( π , π , π , π ) . We use π − i todenote the strategy of other players except i.The value of an information set on the perspective of player i is: v i ( π i , π − i , c i , h ) = E p ( c − i | h,π ) (cid:2) E π (cid:2) r i ( h ) (cid:3)(cid:3) = E p ( c − i | h,π ) (cid:2) v π ( h , c i , c − i ) (cid:3) (1)We use v π ( h, c i , c − i ) replace E π (cid:2) r i ( h ) (cid:3) for simplicity. v π ( h, c i , c − i ) can be interpreted as value of a state underperfect information. h is the history of all possible ter-minal state starting from h with initial hands c . The innerexpectation is taken in the following sense. Suppose thereexists an oracle that knows exactly each players’ initial cardsc and each player’s strategy π , it plays on each player’s be-half with their strategy π starting from h till the end ofthe game. Due to the randomness of mixed strategy, theoutcome, thus the payoff of the game is also random. Theinner expectation is taken over the randomness of the gametrajectory resulted from mixed strategy.The outer expectation is taken over player i ’s belief. Wedefine a scenario to be one possible initial hand configu-ration c − i from the perspective of player i . In Gongzhu,one player’s belief is the probability p ( c | h, π ) she assignsto each possible scenario. At the beginning of the game, allpossible scenarios are of equal probability. This is the casewhen h = h = ∅ , which is usually referred as commonprior by game theorists. When the game proceeds, everyplayer can see what have been played by other players, andthis will change one’s belief about the possibility of differentscenarios. For example, clearly some initial hands config-urations that are not compatible with the rule of Gongzhuwill have probability 0. A natural way of updating belief isBayes’ rule. We will cover more details on calculating theouter expectation on section 5.3 and 5.4.A player’s goal is to optimize its strategy on every decision node, assuming that she knows other players’ strategy π − i : max π i v i ( π i , π − i , c i , h ) (2)She will choose an action on every decision node to maxi-mize her expected payoff. Gongzhu ends after all 52 cardsare played. The extensive form of the game can be regardedas a × -layer decision tree. Thus in principle, Bayesperfect equilibrium (Tadelis, 2013) can be solved by back-ward induction. However we do not to solve for exact Bayesperfect equilibrium here because (i) compared with obtain-ing a Bayes perfect equilibrium policy that is unexploitableby others, it’s more useful to obtain a policy that can beatmost other high level players. (ii) an exact Bayes perfectequilibrium is computationally infeasible.To obtain a strong policy, we train a neural network by self-play. Our neural network has a fully connected structure(see supplementary material 8.3). We divide training andtesting into two separate processes. In training, we train thisneural network to excel under perfect information, and intesting, we try to replicate the performance of model underperfect information by adding Bayesian inference describedin section 5.3 and 5.4.To train the neural network, we assume each player knowsnot only his or her initial hands, but also the initial hands ofother players (see supplementary material 8.7). In the ter-minology of computer bridge, this is called double dummy .Then the outer expectation in equation (1) can be removedsince each player knows exactly what other players havein their hands at any time of the game. The use of perfectinformation in training has two benefits, firstly the random-ness of hidden information is eliminated thus the trainingof neural network becomes more stable, secondly sincesampling hidden information is time consuming, using per-fect information can save time. Although this treatmentmay downplay the use of strategies like bluffing in actualplaying, the trained network performs well. Inspired byAlphaGo Zero (Silver et al., 2018), we use Monte Carlo treesearch as a policy improvement operator to train the neuralnetwork (see supplementary material 8.6). The loss functionis defined as: (cid:96) = Div KL ( p MCTS || p nn ) + λ | v MCTS − v nn | . (3)where p MCTS and v MCTS is the policy and value of a nodereturned by Monte Carlo tree search from that node, and p nn and v nn are the output of our neural network. The parameter λ here weights between value loss and policy loss. Sincethe value of a specific node can sometimes be as large asseveral hundred while KL divergence is at most 2 or 3, thisparameter is necessary. For more details in training, seesupplementary material 8.8.For actual testing or benchmarking, however, we must com-ply with rules and play with honesty. On the other hand, the crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning use of Monte Carlo tree search requires each player’s handto allow each player perform tree search. To bridge the gapwe use stratified and importance sampling to estimate theouter expectation of equation (1). We sample N scenariosby our stratified sampling, then use MCTS with our policynetwork as default value estimator to calculate the Q valuefor each choice. Then we average these Q values with an im-portance weight of hidden information. Finally, we choosethe cards with the highest averaged Q value to play. Detailsof stratified and importance sampling will be discussed insection 5.3 and 5.4. In figure 2, we can see that the neuralnetwork can improve itself steadily. Figure 2.
AI trained with perfect information and tested in standardGongzhu game. Testing scores are calculated by WPG describedin section 5.2. Raw network means that we use MCTS for onestep, with the value network as default evaluater. Mr. Random, Ifand Greed are three human experience AIs described in section5.1. Every epoch takes less than one minute on a single Nvidia2080Ti GPU.
5. Methodology
To expand the strategy space, we build a group of humanexperience based AIs using standard methods. We namethem Mr. Random, Mr. If, and Mr. Greed. Among them,Mr. Greed is the strongest. The performance of these AIsare shown in table 2. More details can be found in supple-mentary material 8.2.
We evaluate the performance of our AIs by letting two copiesof AI team up as partners and play against two Mr. Greeds,and calculate their average winning points. We call thisWinning Point over Mr. Greed (WPG). In typical evaluation,we run × of games. We show that this evaluationsystem is well-defined in supplementary material 8.10. R I G SZSR 0 194 275 319I -194 0 80 127G -275 -80 0 60SZS -319 -127 -60 0 Table 2.
Combating results for 4 different AIs. In this table, Rstands for Mr. Random, I for Mr. If, G for Mr. Greed, SZS forScrofaZeroSimple. R, I and G are human experience AIs describedin Section 5.1 and supplementary material 8.2. SZS is ScrofaZerowithout IEC algorithm described in section 5.4.
Given a history h , initial cards on player i ’s hand c i , oneshould estimate what cards other players have in their hands.As discussed before, we denote it by c − i and use c − i andscenario interchangeably. We use C ( c i ) to denote the set ofall possible c − i ’s.The most natural way to calculate one’s belief about the dis-tribution of scenarios in C ( c i ) is Bayesian inference (Tadelis,2013). From Bayesian viewpoint, if player’s strategy profileis π i =0 , , , , which we now assume to be common knowl-edge, player i ’s belief after observing history h that initialcards in other players’ hands are c − i is p ( c − i | h ) = p ( h | c − i ; π ) p ( c − i ) (cid:80) e ∈C ( c i ) p ( h | e ; π ) p ( e ) (4)where p ( h | c − i ; π ) is the probability that history h is gen-erated if initial card configuration is c − i , and players playaccording to strategy profile π i =0 , , , . We omit the depen-dence on π when there is no confusion.The belief is important because players use it to calculatethe outer expectation in equation (1) E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )] (5)where v π ( h, c i , c − i ) is the value function. The exact cal-culation of (5) through (4) requires calculating all possibleconfigurations in set C ( c i ) , which can contain ∼ elements. Such size is computationally intractable, we there-fore seek to estimate it by Monte Carlo sampling. The naiveapplication of Monte Carlo sampling can bring large vari-ance to the estimation, thus we will derive a stratified im-portance sampling approach to obtain high quality samples.For trick-taking games, there are always cases where somekey cards are much more important than the others. Thesecards are usually associated with high variance in Q value,see section 6.2. Especially true is this for Gongzhu. We usestratified sampling to exhaust all possible configuration ofimportant cards.More specifically, we firstly divide the entire C ( c i ) intoseveral mutually exclusive strata { S , S , ...S p } , such that crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning C ( c i ) = (cid:83) tj =1 S j . Each stratum represents one im-portant cards configuration. To generate the partition { S , S , ...S t } , we identify the key cards { c k , c k , ...c kq } in c − i based on the statistics of trained neural network,see section 6.2, then exhaust all possible configurationsof { c k , c k , ...c kq } . After obtaining the partition, we sampleinside each stratum. More formally, by conditional expecta-tion rule, we can rewrite equation (5) as t (cid:88) j =1 p ( S j ) E c − i ∼ p ( c − i | h,S j ) [ v π ( h, c i , c − i )] (6)where p ( S j ) is the probability that c − i is in stratum S j , p ( S j ) = E c − i ∼ p ( c − i | h ) (cid:2) c − i ∈ S j (cid:3) , and p ( c − i | h, S j ) is theprobability distribution of c − i given the history h and thefact that c − i is in stratum S j . As a zero-th order approxima-tion, we set p ( S j ) = t for all j.Since the expectation in equation (6) is still analyticallyintractable, we employ importance sampling to bypass theproblem. If we can obtain a sample from a simpler distri-bution q ( c − i ) , which has common support with p ( c − i | h ) ,then by Radon-Nikodym theorem: E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )]= E c − i ∼ q ( c − i ) (cid:20) v π ( h, c i , c − i ) p ( c − i | h ) q ( c − i ) (cid:21) (7)where we call the term p ( c − i | h ) q ( c − i ) posterior distribution cor-rection. If we draw N samples C N = { c (1) − i , c (2) − i , ...c ( N ) − i } from q ( c − i ) : E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )] ≈ N N (cid:88) k =1 v π ( h, c i , c ( k ) − i ) p ( c ( k ) − i | h ) q ( c ( k ) − i ) (8)We take q ( c − i ) to be the following distribution: q ( c − i ) = (cid:40) / |C ( c − i ) | if c − i is compatible with history otherwise (9)i.e. q ( c − i ) is a uniform distribution for all c − i that is com-patible with history. Compatible with history means undersuch configuration actions in history h do not violate anyrules.Since the ratio p ( c − i | h ) q ( c − i ) is still intractable, we use ˆ p ( c ( k ) − i | h ) = p ( h | c ( k ) − i ) p ( c ( k ) − i ) (cid:80) Nl =1 p ( h | c ( l ) − i ) p ( c ( l ) − i ) , ˆ q ( c − i ) = 1 N (10)to approximate p ( c ( k ) − i | h ) and q ( c ( k ) − i ) . Equation (10) changethe the scope of summation on the denominator of (4) from the entire population to only samples. Then equation (7)reduces to: E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )] ≈ (cid:80) Nk =1 s ( c ( k ) − i ) N (cid:88) l =1 v π ( h, c i , c ( l ) − i ) s ( c ( l ) − i ) (11)where s ( c ( k ) − i ) is the score we assign to scenario c ( k ) − i , it isdefined as s ( c ( k ) − i ) = p ( h | c ( k ) − i ) p ( c ( k ) − i ) . We will introduce analgorithm to calculate the score in section 5.4. In this section, we will focus on how to compute s ( c − i ) .We assume that other players are using similar strategiesto player i . Then the policy network of ScrofaZero can beused to estimate p ( h | c i ) . To continue, we define correctionfactor γ for a single action as γ ( a, h, c j ) = e − β · regret = e − β ( q max − q a ) , (12)to be the unnormalized probability of player j taking action a under the assumption of j using similar strategies to player i . In definition (12), h is the history before action a , c j thehand cards for player j , q a the policy network output forplayer j ’s action a and q max the greatest value in outputsof legal choices in c j , β a temperature controlling level ofcertainty of our belief. Then the p ( h | c − i ) in formula (4) canbe written as p ( h | c − i ) = p ( h u | c − i ) = u − (cid:89) t =0 p ( a t +1 | h t , c j ( t +1) )= u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) (cid:80) α ∈ lc ( t ) γ ( α, h t , c j ( t +1) ) , (13)where lc ( t ) is legal choices at stage t . As a generalizedapproach of Bayesian treatment, we estimate p ( h | c − i ) withproducts of correction factors . We call this algorithm Inte-gral over Equivalent Class (IEC). The pseudocode for IECis as algorithm 1. As an attention mechanism and to savecomputation resources, some “important” history slices areselected based on statistics in section 6 in calculating thescenario’s score in algorithm 1, see supplementary material8.12 for detail.Compared with naive Bayes weighting, our IEC weightingis insensitive to variations of number of legal choices thusmore stable. Experiments show that IEC can outperformnaive Bayes weighting by a lot, see table 3.In rest of this section we will explain the intuition of integralover equivalent class . We begin by introducing the conceptof irrelevant cards . Irrelevant cards should be the cardswhich (i) will not change both its correction factor and other crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning
Algorithm 1
Integral over Equivalent Class (IEC)
Input: history h u , player i ’s initial card c i , one possiblescenario c − i . s ( c − i ) ← for t = u − , u − , ..., do h ← h t , c ← c j ( t +1) , a ← a t +1 if a is important then s ( c − i ) ← γ ( a, h, c ) s ( c − i ) end ifend forOutput: Score for scenario s ( c − i ) .Techniques Performance Win(+Draw) RateUS . ± . . . SS . ± . . . US with IEC . ± . -SS with IEC . ± . . . SS with IEC(against US) . ± . . . Table 3.
Performance after different methods. US stands for Uni-formly Sampling, SS for Stratified Sampling, IEC for
Integral overEquivalent Class . The sampling number of US is set to 9 such thatthe sampling number will equal to that of SS. The last line of thistable is ScrofaZero with the strongest sampling technique, SS withIEC, against itself without any method. cards’ correction factors if it is moved to others’ hand, and(ii) will not change its correction factor if other cards aremoved. The existence of approximate irrelevant cards canbe confirmed both from experiences of playing games orfrom the statistics in section 6. In figure 4 of section 6.1, wesee that there are some cards whose variance of values aresmall. These cards are candidates approximate irrelevantcards . See supplementary material 8.11 for an concreteexample.We call two distributions of cards only different in irrelevantcards equivalent. This equivalent relation divides all scenar-ios C ( c − i ) into equivalent classes. We denote the equivalentclass of scenario c − i as [ c − i ] . We should integrate overthe whole equivalent class once we get the result of onerepresent element, because the MCTS procedure for eachscenario is expensive. The weight of one equivalent classshould be p ( h u | c − i ) p ([ c − i ]) = (cid:88) all permutations ofirrelevant cards u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J j ( t +1) (14)where j ( t ) is the player who played at stage t , J j ( t ) thesum of correction factors of irrelevant cards in j ( t ) and Y t the sum of correction factors of other cards in j ( t ) . Y may change in different scenarios but (cid:80) j =1 J j will keep unchanged by definition, denoted by J .Follow the steps in supplementary material 8.14, we canobtain the result of the summation in equation (14) p ( h u | c − i ) p ([ c − i ]) = 3 N u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ O ( ξ ) (15)where N is the number of irrelevant cards, J the sum ofcorrection factor of all irrelevant cards and ξ a real numberbetween and / . One can see the supplementary material8.14 for detail.But notice that, the denominators in the result of (15) areinsensitive to change of Y because both Y and J are alwaysgreater than 1 (see section 6.1 for the magnitude of Y and J ).For scenarios in different equivalent classes, their Y ’s mightbe different but the J is always the same. So the integralremains approximately the same. Thus we can ignore thedenominators when calculating scenario’s score. Or in otherwords, we can use the product of unnormalized correctionfactors as scenario’s score. This is exactly the procedure ofIEC.
6. Empirical Analysis
To begin with, we present some basic statistics for neuralnetwork. They include mean and variance of value andcorrection factor γ of playing a card. The average correctionfactor γ shown in figure 3 reflect cards are “good” or not:the higher correction factor, the better to take such action.We can find in the figure that, SQ is not a “good” card,one would better not play it. Another finding is that thecorrection factor of four suits peak at around 10, thus whenone player choose between cards lower than 10, she shouldplay the largest one.However, for Gongzhu, the value of a card highly dependson the situation. Hence it’s important to study the vari-ance of correction factor. For example, a SQ will bringlarge negative points ends in your team but will bring largeprofits ends in your opponent. Variance of values shownin figure 4 illustrates the magnitude of risk when dealingwith corresponding cards. We can see that SQ’s variance islarge, which is in line with our analysis. Meanwhile, heartcards should be dealt differently under different situations.Among heart suit, HK and HA are especially important.This may be the result of finesse technique and all hearts inGongzhu.These statistics from ScrofaZero reveal which cards are im-portant. This information is used in stratified sampling insection 5.3. Also, they are consistent with human experi-ence. crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning Figure 3.
Average value of correction factor for different cards.The β used here is equal to . . Figure 4.
Variance of values for different cards.
The best classical AI Mr. Greed explained in Section 5.1 hasmany fine-tuned parameters, including the value for eachcard. For example, although SA and SK are not directlyassociated with points in the rule of Gongzhu, they havea great chance to get the most negative card SQ whichweights − . So SA and SK are counted as − and − points respectively in Mr. Greed. These parametersare fine-tuned by human experience. In the area of chess,people have compared the difference for chess piece relativevalue between human convention and deep neural networkAI trained from zero (Tomaˇsev et al., 2020). Here we willconduct a similar analysis to our neural network AI and Mr.Greed’s parameters. Table 4 shows experience parametersin Mr. Greed for some important cards and ScrofaZero’soutput under typical situations. Negative means that cardis a risk or hazard, thus it is better to get rid of it, while positive value has the opposite meaning. We can see thatMr. Greed and ScrofaZero agrees with each other very well.Cards Mr. Greed ScrofaZeroSA − − ∼ − SK − − ∼ − CA − − CK − − CQ − − CJ − − DA
30 20 DK
20 10 DQ
10 10
Table 4.
The experience parameters in Mr. Greed and output ofScrofaZero. Negative means that card is a burden, positive the op-posite. ScrofaZero’s values are estimated under typical situations.
7. Conclusion
In this work we introduce trick-taking game Gongzhu as anew benchmark for incomplete information game. We trainScrofaZero, a human expert level AI capable of distillinginformation and updating belief from history observations.The training starts from tabula rasa and does not need do-main of human knowledge. We introduce stratified samplingand IEC to boost the performance of ScrofaZero.Future research directions may include designing bettersampling techniques, incorporating sampling into neuralnetwork, and applying our methods to other trick-takinggames like contract bridge. Also, we believe the knowledgein training ScrofaZero can be transferred to other real worldapplications where imperfect information plays a key rolefor decision making.
References
Brown, N. and Sandholm, T. Solving imperfect-informationgames via discounted regret minimization. In
The Thirty-Third AAAI Conference on Artificial Intelligence , 2019.Brown, N., , and Sandholm, T. Superhuman ai for multi-player poker.
Science , 2019.Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowl-ing, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samoth-rakis, S., and Colton, S. A survey of monte carlo treesearch methods.
IEEE Transactions on ComputationalIntelligence and AI in Games , 4, 2012.Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochas-tic and nonstochastic multi-armed bandit problems.
Foun-dations and Trends in Machine Learning , 5, 2012.Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omid- crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning shafiei, S., Balduzzi, D., and Jaderberg, M. Real worldgames look like spinning tops. In
NeurIPS , 2020.Doyle, A. C.
The Sign of Four .Grill, J.-B., Altch´e, F., Tang, Y., Hubert, T., Valko, M.,Antonoglou, I., and Munos, R. Monte-carlo tree searchas regularized policy optimization. In
International Con-ference on Machine Learning , 2020.Kuhn, H. W. Simplified two-person poker.
Contributions tothe Theory of Games , 1950.Letcher, A. On the impossibility of global convergencein multi-loss optimization. In
International Conferenceon Learning Representations , 2021. URL https://openreview.net/forum?id=NQbnPjPYaG6 .Letcher, A., Balduzzi, D., Racani‘ere, S., Martens, J., Foer-ster, J., Tuyls, K., and Graepel, T. Differentiable gamemechanics.
Journal of Machine Learning Research , 2019.Li, J., Koyamada, S., Ye, Q., Liu, G., Wang, C., Yang,R., Zhao, L., Qin, T., Liu, T.-Y., and Hon, H.-W.Suphx: Mastering mahjong with deep reinforcementlearning. arXiv , 2020. URL https://arxiv.org/abs/2003.13590 .OpenAI, Berner, C., Brockman, G., Chan, B., Cheung, V.,Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme,S., Hesse, C., J´ozefowicz, R., Gray, S., Olsson, C., Pa-chocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman,J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S.,Sutskever, I., Tang, J., Wolski, F., and Zhang, S. Dota 2with large scale deep reinforcement learning. 2019. URL https://arxiv.org/abs/1912.06680 .Rong, J., Qin, T., and An, B. Competitive bridge biddingwith deep neural networks.
Proceedings of the 18th Inter-national Conference on Autonomous Agents and MultiA-gent Systems , 2019.Selten, R. Reexamination of the perfectness concept forequilibrium points in extensive games.
Internationaljournal of game theory , 1975.Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,Aja Huang, A. G., Hubert, T., Baker, L., Lai, M., Bolton,A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van denDriessche, G., Graepel, T., and Hassabis, D. Masteringthe game of go without human knowledge.
Nature , 2017.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-pel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. Ageneral reinforcement learning algorithm that masterschess, shogi, and go through self-play.
Science , 2018. Southey, F., Bowling, M. P., Larson, B., Piccione, C., Burch,N., Billings, D., and Rayner, C. Bayes’ bluff: Opponentmodelling in poker.
Proceedings of the Twenty-First Con-ference on Uncertainty in Artificial Intelligence , 2005.Tadelis, S.
Game Theory: An Introduction . Princeton Uni-versity press, 41 William Street, Princeton, New Jersey,2013.Tomaˇsev, N., Paquet, U., Hassabis, D., and Kramnik, V.Assessing game balance with alphazero: Exploring alter-native rule sets in chess, 2020.Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M.,Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds,T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Dani-helka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P.,Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T.,Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine,T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring,R., Yogatama, D., W¨unsch, D., McKinney, K., Smith,O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis,D., and an David Silver, C. A. Grandmaster level in star-craft ii using multi-agent reinforcement learning.
Nature ,2019.Whitehouse, D.
Monte Carlo Tree Search for games withHidden Information and Uncertainty . PhD thesis, Univer-sity of York, 7 2004.Wikipedia. Computer bridge, 2021. URL https://en.wikipedia.org/wiki/Computer_bridge .Zinkevich, M., Johanson, M., Bowling, M., and Piccione,C. Regret minimization in games with incomplete in-formation. In , 2007. crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning
8. Experimental Details and Extend Data
Gongzhu belongs to the class trick-taking, which is a largeset of games including bridge, Hearts, Gongzhu and Shengji.We dedicate this section to familiarizing readers with trick-taking games. Trick-taking games share the following com-mon rules.1. A standard 52-card deck is used in most cases.2. Generally, there are four players paired in partnership,with partners sitting opposite to each other around atable.3. Cards are shuffled and dealt to four players at the be-ginning.4. As the name suggests, trick-taking game consists of anumber of tricks. In a trick, four players play one cardsequentially by the following rules:• The player leading the first trick is chosen ran-domly or by turns. The first card of each trick canbe any card in that player’s hand.• Following players should follow the suit if pos-sible. There are no limits on the ranking of thecards played.• At the end of each trick, four cards played areranked and the player who played the card ofhighest rank becomes the winner.• The winner of the last trick leads the next trick.• The playing order is usually clockwise.5. The cards are usually ranked by: A K Q J 10 9 8 7 6 54 3 2.
We build a group of human experience based AIs usingstandard methods. The group includes1. Mr. Random: Random player choosing cards fromlegal choices randomly.2. Mr. If: A program with 33 if statements representinghuman experience. Mr. If can outperform Mr. Randoma lot with such a few number of ifs.3. Mr. Greed: AI with hundreds of if statements and par-tial backward induction. It contains many handcraftedhyper-parameters to complete the search. Mr. Greedcan outperform Mr. If, but not proportional to theirnumber of if statements.The performances of these AIs are shown in table 2. Moredetails can be found in our repository.
We use fully connected layer with skip connection as ourmodel basic block. The input of our neural network isencoded into a 434-dimensional onehot vector (this will beexplained in detail in the next subsection). The output isa 53 dimensional vector with the first 52 elements to be p vector and the last element to be v .We also tried other network topologies, including shallowerfully connection layers and ResNet. Their performance areshown in table 5.Network Topology ± Fully Connection 24 11416299 ± ResNet 18 11199093 − ± Table 5.
Performance of different networks. These scores are eval-uated by WPG introduced in section 5.2. Sufficient rounds ofgames are played to make sure the variance is small enough.
Since ResNet does not show significant improvement inperformance, we stick to fully connected neural network formost of our experiments.
The input of our neural network is a × × × dimension vector.• × denotes the hands of 4 players. becauseGongzhu uses standard 52-card deck. Cards of theother 3 players are guessed by the methods discussedin Section 5.3.• × denotes the cards played in this trick. We choosethese format because at most 3 cards are played beforethe specified player can play. We use instead of to represent the cards due to the diffuse technique described in the next subsection.• × denotes the cards associated with points thatis already played. In the game Gongzhu, there are cards which have scores. We use diffuse technique for representing cards in this trickwhen preparing inputs. Normally we should set one spe-cific element of onehot vector corresponding to a card to .However, we want to amplify the input signal. Hence weset not only the element in the onehot vector correspond-ing to this card, but also the two adjacent elements, to 1.Also we extend length for representing each card from and to to diffuse the element at two endpoints. Figure5 shows how diffuse technique works. Input in this form crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning can be transformed to and from standard input with a singlefully connection layer. In experiments, diffuse techniqueaccelerates training. Figure 5.
We use diffuse technique when representing cards in thistrick to accelerate training. Above shows the normal input, whilebelow shows the input after using diffuse technique.
We used the standard MCTS using the value network toevaluate positions in the tree. We use UCB (Upper Confi-dence Bound)(Bubeck & Cesa-Bianchi, 2012) method toselect a child node. More specifically, for each selection,we choose node i = argmax j v j + c (cid:112) ln N/n j , where v j isthe value of node j , n j is how many times node j has beenvisited, N is the total visit round, c is the exploration con-stant. There are two hyperparameters important in MCTS:exploration constant c and search number T MCTS . We keepexploration constant to c = 30 . As the search number, weset T MCTS = 2 × { legal choice number } when training and T MCTS = 10 + 2 × { legal choice number } when evaluat-ing. The search number is crucial in training and will bediscussed in the next subsection. The network is trained under perfect information where itcan see cards in others’ hands. Or in other words, we donot need to sample hidden information during training. Thissetting might be preferred because1. without sampling, training is more robust;2. by this means, more diversified middlegames and end-ings can be reached, which helps neural network im-prove faster by learning from different circumstances.For example, we find neural network trained with perfectinformation masters the techniques in all hearts much morefaster than one trained with imperfect information.After each MCTS search and play, the input and the re-wards for different cards will be saved in buffer for train-ing. The search number in MCTS is crucial. When itis small, the neural network of course will not improve.However, we find that, when search number in MCTS istoo large, the neural network will again not improve, oreven degrade! We find × { legal choice number } the mostsuitable number for MCTS searching. Notice that, with × { legal choice number } searches, MCTS can only pre-dict the future approximately after two cards are played. Itis quite surprising that the neural network can improve andfinally acquire “intuition” for long term future. We typically let four AIs play 64 games then train for onebatch. What’s more, inspired by online learning, we willalso let neural network review data of last two batches.So the number of data points in each batch is × × . Then the target function (3) is optimized byAdam optimizer with lr = 0 . and β = (0 . , . for3 iterations.There is one thing special in our dealing with loss functionwhich deserves some explanations. Normally, it’s natural tomask the illegal choice in the probability output of network(i.e. mask them after the softmax layer). However, we maskthe output before softmax layer. Or in coding language,we use softmax ( p × legal mask ) rather than legal mask × softmax ( p ) . We find this procedure much better than theother one. A possible explanation is that, if we multiply themask before softmax, the information of illegal choice isstill preserved in the loss, which can help the layer beforeoutput to find a more reasonable feature representation. We provide an arena for different AIs to combat. We definestandard protocols for AIs and arena to commute. We setup an server welcoming AIs from all around the world.Every AI obeying our protocol can combat with our AIsand download detailed statistics. We also provide data andstandard utility functions for others training usage. Moredetails can be found in our GitHub repository.
In section 5.2, we introduced Winning Point over Mr. Greed(WPG). For this evaluation system to be well-defined, WPGshould be transitive, or at least approximately transitive,i.e. program with higher WPG should play better. In thissection, We will introduce a statistics ε that measures theintransitivity of an evaluation function, then show that WPGis nearly transitive by numerical experiments.We first define ξ ij to be the average winning score of play-ing strategy π i against strategy π j . Then let us start byconsidering two extreme cases. A game is transitive on Π ,if ξ ij + ξ jk = ξ ik ∀ π i , π j , π k ∈ Π , (16)where Π is a subspace of the strategy space. As afamous example for intransitive game, 2-player rock-paper-scissor game is not transitive for strategy tuple ( Always play rock , Always play paper , Always play scissor ) . crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning In the middle of the totally transitive and intransitive games,we want to build a function ε Π to describe the transitivityof policy tuple Π = ( π , π , ...π n ) . To better character-ize the intransitivity, the function ε should have followingproperties:a) take value inside [0 , , and equal when evaluationsystem is totally transitive, when totally intransitive;b) be invariant under translation and inflation of scores;c) be invariant under reindexing of π i ’s in Π ;d) take use of combating results of every triple strategies ( π i , π j , π k ) in Π . There are C n = n ( n − n − / such triples.e) will not degenerate (i.e. approach to 0 or 1) underinfinitely duplication of any π i ∈ Π ;f) be stable under adding similar level strategies into Π .We define ε to be ε Π = (cid:80) i Table 6. An example for irrelevant card. Here D4 is the irrelevantcard. γ is correction factor of corresponding cards. From figure 4, we know that irrelevant cards are most likelyappear in spade. Table 7 is an example for irrelevant card SJ.Apart from SJ, other small cards, S2 to S7 are also highlylikely to be irrelevant cards in this situation.Cards Guessed γ Without SJ γ With SJS5 0.9424 0.9464S10 1.0000 1.0000SJ - 0.9598S8 0.9560 0.9585S10 1.0000 1.0000SQ 0.4818 0.4950SJ - 0.9528S7 1.0000 1.0000SK 0.8302 0.8331SJ - 0.9677 Table 7. An example for irrelevant card. Here SJ is the irrelevantcard. γ is correction factor of corresponding cards. Like what we have discussed in section 5.4, as an atten-tion mechanism and to save computation resources, some“important” history slices are selected based on statistics insection 6 in IEC algorithm. From figure 3 and 4, we can see crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning that, the cards smaller than 8 always have small correctionfactors in its suit and lower value variance. This meansthat the history slices with card played smaller than 8 arehighly likely to be unimportant. So we only select historyslices with the card played greater than 7 in IEC algorithm.Another selection rule is that, when a player is followingthe suit, history slices of other suits are also not important. In section 5.4, we introduced IEC algorithm. However, ournetwork’s input requests other’s hands. We should not givehands of this scenario directly to neural network, becausethat player cannot know the exact correct hands of otherplayers. As a work around, we average the hands informa-tion and give it to neural network, shown in figure 6. Inother words, we replace the “onehot” in input representingother’s hands with “triple- / -hot”. Figure 6. The network input representing other’s hands is averagedin IEC. The replacement from “one” to “triple- / ”is not standard.Table 8 shows that the performance will not deteriorateunder this nonstandard input.Input method WPGStandard input (“onehot”) ± Averaged Input (“triple- / -hot”) ± . Table 8. Raw network performance under standard input and av-eraged input. Raw network means that we directly use policynetwork in playing rather than MCTS. (15)Once irrelevant cards are defined, c − i is divided in to threeparts: cards already played, relevant cards and approxi-mately irrelevant cards. Form now on, we will refer ap-proximately irrelevant cards as irrelevant cards. The J j inequation (14) is the sum of all of irrelevant cards’ correc-tion factors in player j , while the Y t in equation (14) is thesum of the other cards’ correction factors of the player whoplayed stage t . (cid:88) c k ∈ irrelevant cardsin player j γ ( c k ) (cid:44) J j , (cid:88) c k ∈ cards played ∪ relevant cards γ ( c k ) (cid:44) Y t (21)Notice that by definition of irrelevant cards , (cid:80) j =1 J j should keep unchanged in different scenarios for this deci-sion node, denoted by J . J + J + J = Const (cid:44) J. (22)The distribution of J j is polynomial. In most situations,there are always many (5 or more) irrelevant cards. Bycentral limit theorem, a polynomial distribution can be ap-proximated by a multivariate normal distribution. Since J = J − J − J , we can derive the marginal distributionof J and J . We adopt this approximation and replacethe summation over all permutations of irrelevant cards inequation (14) by an integral p ( h u | c − i ) ≈ N π | Σ | (cid:90) (cid:90) x + x 12 ( x , x )Σ − ( x , x ) T (cid:19) u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ x j ( t +1) d x d x = 3 N J π | Σ | u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ O ( ξ )= 3 N u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ O ( ξ ) (23)where N is the number of irrelevant cards, J the sum ofcorrection factor of all irrelevant cards, x j = J j − J/ and ξ a real number between and / . The last equality isbecause Σ satisfies π | Σ | (cid:90) (cid:90) x + x < J x , > − J e − ( x ,x )Σ − ( x ,x ) T d x d x =1 − O ( ξ ) ..