[PDF] ScrofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning

Abstract

People have made remarkable progress in game AIs, especially in domain of perfect information game. However, trick-taking poker game, as a popular form of imperfect information game, has been regarded as a challenge for a long time. Since trick-taking game requires high level of not only reasoning, but also inference to excel, it can be a new milestone for imperfect information game AI. We study Gongzhu, a trick-taking game analogous to, but slightly simpler than contract bridge. Nonetheless, the strategies of Gongzhu are complex enough for both human and computer players. We train a strong Gongzhu AI ScrofaZero from \textit{tabula rasa} by deep reinforcement learning, while few previous efforts on solving trick-taking poker game utilize the representation power of neural networks. Also, we introduce new techniques for imperfect information game including stratified sampling, importance weighting, integral over equivalent class, Bayesian inference, etc. Our AI can achieve human expert level performance. The methodologies in building our program can be easily transferred into a wide range of trick-taking games.

Full PDF

SScrofaZero: Mastering Trick-taking Poker Game Gongzhu by DeepReinforcement Learning

Naichen Shi * 1

Ruichen Li * 2

Sun Youran * 3

Abstract

People have made remarkable progress in gameAIs, especially in domain of perfect informationgame. However, trick-taking poker game, as apopular form of imperfect information game, hasbeen regarded as a challenge for a long time.Since trick-taking game requires high level ofnot only reasoning, but also inference to excel, itcan be a new milestone for imperfect informationgame AI. We study Gongzhu, a trick-taking gameanalogous to, but slightly simpler than contractbridge. Nonetheless, the strategies of Gongzhuare complex enough for both human and com-puter players. We train a strong Gongzhu AI Scro-faZero from tabula rasa by deep reinforcementlearning, while few previous efforts on solvingtrick-taking poker game utilize the representationpower of neural networks. Also, we introducenew techniques for imperfect information gameincluding stratiﬁed sampling, importance weight-ing, integral over equivalent class, Bayesian in-ference, etc. Our AI can achieve human expertlevel performance. The methodologies in buildingour program can be easily transferred into a widerange of trick-taking games.

1. Introduction

We live in a world full of precariousness. Like a famousquotation from Sherlock Holmes, “I have a turn both forobservation and for deduction.”(Doyle), one should deducehidden information from seemingly random observationsto make good decisions. Imperfect information game isan abstraction of multi-agent decision making with privateinformation. Related theory has been found useful in auc-tion, mechanism design, etc (Tadelis, 2013). The researchof speciﬁc examples of imperfect information game can * Equal contribution IOE, University of Michigan EECS,Peking University Yau Mathematical Sciences Center, Ts-inghua University.. Correspondence to: Sun Youran < [email protected] > . strengthen our abilities to navigate through the uncertaintiesof the world.People have successfully built super-human AIs for perfectinformation games including chess (Silver et al., 2017) andGo (Silver et al., 2018) by using deep reinforcement learning.Also, by combining deep learning with imperfect sampling,researchers have also made huge progress in Mahjong (Liet al., 2020), StarCraft (Vinyals et al., 2019), Dota (OpenAIet al., 2019), and Texas hold’em (Brown et al., 2019).We study Gongzhu, a 4-player imperfect information pokergame. Gongzhu is tightly connected with a wide range oftrick-taking games. The detailed rules will be introduced insection 2. Building a strong Gongzhu program can deepenour understanding about imperfect information games.We study Gongzhu for three reasons. Firstly, Gongzhucontains medium level of randomness and requires carefulcalculations to reign supreme. Compared with Mahjong andTexas hold’em, Gongzhu is more complicated since its deci-sion space is larger, let alone toy poker games pervasivelystudied in like Leduc(Southey et al., 2005) and Kuhn(Kuhn,1950). What’s more, it is important to read the signals fromthe history of other players’ actions and update beliefs abouttheir private information continuously. The entanglementof sequential decision making and imperfect informationmakes it extremely nontrivial to ﬁnd a good strategy out ofhigh degree of noise.Secondly, the scoring system of Gongzhu is relativelysimple compared with bridge, since players don’t bid inGongzhu. Thus the reward function is easier to design, andwe can focus on training highly skilled playing AI givensuch reward function.Thirdly, compared with large scale games like StarCraft orDota, Gongzhu is more computationally manageable: allexperiments in this paper can be done on only 2 Nvidia2080Ti GPUs!We train a Gongzhu program from tabula rasa by self playwithout any prior human knowledge other than game rules.Our algorithm is a combination of Monte Carlo tree searchand Bayesian inference. This method is an extension ofMCTS to imperfect information game. Our program defeats a r X i v : . [ c s . G T ] F e b crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning expert level human Gongzhu players in our online platformGongzhu Online.We summarize our contributions below:• We introduce the game of Gongzhu that is more difﬁ-cult than Leduc but more manageable than StarCraft.Gongzhu can be a benchmark for different multi-agentreinforcement learning algorithms.• We train a strong Gongzhu agent ScrofaZero purely byself-play. The training of ScrofaZero requires neitherhuman expert data nor human guidance beyond gamerules.• To our best knowledge, we are the ﬁrst to combineBayesian inferred importance sampling with deep neu-ral network for solving trick-taking game. Our methodscan be transferred to other trick-taking games includingcontract Bridge.The paper is organized as follows. In section 2, we reviewthe rule of Gongzhu and its connection to other trick-takinggames. In section 3, we discuss related work in literature.In section 4, we present an overview of our framework. Insection 5, we show some of our key methodologies includingstratiﬁed sampling and integral over equivalent class. And inthe section 6, we analyze the results of extensive empiricalexperiments.

2. Rule of Gongzhu and Related Games

Before delving further into our program, we will use asection to familiarize readers the rules of Gongzhu. Wewill introduce rules from general trick-taking poker game tospeciﬁc game Gongzhu.Gongzhu belongs to the class trick-taking , which is a largeset of games including bridge, Hearts, Gongzhu and Shengji.For the readers unfamiliar with any of these games, seesupplementary material 8.1 for the common rules of tricktaking games. From now on, we assume readers understandthe concept of trick .The class trick-taking is divided into two major familiesaccording to the goal of the games: family plain-trick andfamily point-trick . In family plain-trick , like Whist, Con-tract bridge and Spades, the goal of games is to win speciﬁcor as many as possible number of tricks. In family point-trick , like Black Lady, Hearts, Gongzhu and Shengji, thegoal of games is to maximize the total points of cards ob-tained.For most of games in the family point-trick , only some cardsare associated with points. Depending on the point countingsystem, family point-trick is furthermore subdivided intotwo genus, evasion and attack. For genus evasion, most

Figure 1.

The classiﬁcation of trick-taking games. Gongzhu be-longs to the genus evasion, the family point-trick. As shown in thisﬁgure, Gongzhu is tightly connected with a wide variety of games. of points are negative, so the strategy is usually to avoidwinning tricks, while genus attack the opposite. Gongzhubelongs to the genus evasion. The points in Gongzhu arecounted by following rules.1. Every heart is associated with points. Their points areshown in table 1. Notice that,• all points of hearts are non-positive;• the higher rank a heart card has, the greater itspoint’s absolute value will be;• the total points of hearts are -200.2. SQ has − points. SQ is called zhu (scrofa) in thisgame. As zhu has the most negative points, playerswill try their best avoid getting it.3. DJ has +100 points. DJ is called yang (sheep/goat), incontrast with zhu .4. C10 will double the points. However if one gets onlyC10, this C10 is counted as +50 . C10 is called trans-former for obvious reasons.5. If one player collects all the 13 hearts, to reward herbraveness and skill, the points of all the hearts willbe count as +200 rather than − . This is called allhearts . It is worth clarify that,• to get all hearts , one need get all 13 hearts, in-cluding zero point hearts H2, H3 and H4;• the points are counted separately for each playerand then summed together in each team.All the rules except all hearts are summarized in table 1.The classiﬁcation of trick-taking games and where Gongzhu crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning belongs are shown in ﬁgure 1.HA − SQ − HK − DJ +100 HQ −

30 +50 or doubleHJ − C10 the pointsH5 to H10 − H2 to H4 0

Table 1.

Points of cards in Gongzhu. Gongzhu is a trick-takingclass, point-trick family, evasion genus game. Note that except forpoints shown in the above table, there is an extra all hearts ruleexplained in Section 2.

3. Literature review

Different from Go, Gongzhu is a dynamic incomplete infor-mation game (Tadelis, 2013). The research on the dynam-ics of imperfect information game and type of equilibriumhas a long history, for example (Selten, 1975) introducesa more reﬁned equilibrium concept called trembling handequilibrium. Our work applies these theoretical analysisincluding Bayesian inference and sequential decision mak-ing to the trick-taking game Gongzhu. Recently, peopleproposed counterfactual regret minimization related algo-rithms (Zinkevich et al., 2007; Brown & Sandholm, 2019)that can be proved to ﬁnd an (cid:15) -Nash equilibrium. The ap-plications of counterfactual regret minimization type algo-rithms have been found successful in Texas hold’em (Brownet al., 2019). Such algorithms are not directly applicableto Gongzhu, which has a larger decision space and longerdecision period.

As shown in section 2, Gongzhu is tightly connected withbridge. However unlike chess and go, computer bridge pro-grams cannot beat human experts yet. The recent winnersof the World Computer-Bridge Championship (Wikipedia,2021) are Wbridge5 (on 2018) and Micro Bridge (on2019) . Micro Bridge ﬁrst randomly generates unknownhands under known conditions derived from history, thenapply tree search and pruning algorithms to make decisionunder perfect information. Wbridge5 does not reveal theiralgorithm to public, but it is believed to be similar to MicroBridge, i.e. human-crafted rules for bidding and heuristictree search algorithms for playing. Different from thesework, we use deep neural network to evaluate current situa- The 2020 Championship is cancelled. The next championshipwill be in 2021. tion and to generate actions. Recently, (Rong et al., 2019)built a bidding program by supervised learning and then byreinforcement learning. In contrast, we train our Gongzhuprogram from tabula rasa . The use of Monte Carlo tree search both as a good rolloutalgorithm and a stable policy improvement operator is popu-lar in perfect information game like go (Silver et al., 2018),and chess (Silver et al., 2017). (Grill et al., 2020) analyzedsome theoretical properties of MCTS used in AlphaGo zero.(Whitehouse, 2004; Browne et al., 2012) discusses popularMCTS algorithms, including information set Monte Carlotree search (ISMCTS), an algorithm that combines MCTSwith incomplete. The Monte Carlo tree search algorithmused in our program is the standard upper-conﬁdence boundminimizing version, and computationally simpler than thefull ISMCTS.

The dynamics of training multi-player game AIs by self-playcan be complicated. People have found counter-exampleswhere almost all gradient-based algorithms fail to convergeto Nash equilibrium (Letcher, 2021). Also, some games arenontransitive for certain strategies (Czarnecki et al., 2020).Instead of a Nash equilibrium strategy, we attempt to ﬁnda strong strategy that can beat human expert level player.We also deﬁne a metric for the nontransitivity in the game.Different from (Letcher et al., 2019), where nontransitivityis deﬁned only in the parameter space, our metric can bedeﬁned for any strategy.

4. Framework

This section is an overview of our program. We start byintroducing some notations. As discussed above, Gongzhuis a sequential game with incomplete information. The players are denoted as N = { , , , } . We denote inte-ger u ∈ [0 , as the stage of game, which can also beunderstand as how many cards have been played. We deﬁne history h u ∈ H is a sequence of tuple { ( i t , a t ) } t =1 ,...u ,where t is stage, i t is the player to take action at stage t , and a t is the card her played. History represent as the historyof card playing up to time u . We sometimes replace h u by h for simplicity. We use h ( t ) denote the t -th tuple in h .History h is publicly observable to all players. It’s naturalto assume that each player can perfectly recall, i.e. they canremember the history exactly. After dealing of each round,players will have their initial hand cards c = { c i } i =0 , , , . c i is private information for player i, and other players can-not see it. Also, we denote c ui as the i -th player’s remaininghand cards at stage u . Action a ∈ A is the card to play.We set |A| = 52 . By the rule of Gongzhu, a player may crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning not be able to play all cards depending on the suits of thattrick, thus actual legal choice of a might be smaller than thenumber of remaining cards in one’s hand.Information set is a concept in incomplete information gametheory that roughly speaking, characterizes the informa-tion input for decision making. For standard Gongzhugame, we deﬁne the information set for player i at time u to be ( h u , c i ) , i.e. public information h u combined withplayer i ’s private information c i . The payoff r is calculatedonly at the end of the game, i.e. when all 52 cards areplayed. r ( h T ) = ( r ( h T ) , r ( h T ) , r ( h T ) , r ( h T )) (where T = 52 ) representing the scores for each player. A strategy of player i is a map from history and private information toa probability distribution in action space: π i : H × C i → A ,i.e. given a speciﬁc history h and an initial card c i , π i ( c i , h ) chooses a card to play. A strategy proﬁle π is the tuple of4 players’ strategies π = ( π , π , π , π ) . We use π − i todenote the strategy of other players except i.The value of an information set on the perspective of player i is: v i ( π i , π − i , c i , h ) = E p ( c − i | h,π ) (cid:2) E π (cid:2) r i ( h ) (cid:3)(cid:3) = E p ( c − i | h,π ) (cid:2) v π ( h , c i , c − i ) (cid:3) (1)We use v π ( h, c i , c − i ) replace E π (cid:2) r i ( h ) (cid:3) for simplicity. v π ( h, c i , c − i ) can be interpreted as value of a state underperfect information. h is the history of all possible ter-minal state starting from h with initial hands c . The innerexpectation is taken in the following sense. Suppose thereexists an oracle that knows exactly each players’ initial cardsc and each player’s strategy π , it plays on each player’s be-half with their strategy π starting from h till the end ofthe game. Due to the randomness of mixed strategy, theoutcome, thus the payoff of the game is also random. Theinner expectation is taken over the randomness of the gametrajectory resulted from mixed strategy.The outer expectation is taken over player i ’s belief. Wedeﬁne a scenario to be one possible initial hand conﬁgu-ration c − i from the perspective of player i . In Gongzhu,one player’s belief is the probability p ( c | h, π ) she assignsto each possible scenario. At the beginning of the game, allpossible scenarios are of equal probability. This is the casewhen h = h = ∅ , which is usually referred as commonprior by game theorists. When the game proceeds, everyplayer can see what have been played by other players, andthis will change one’s belief about the possibility of differentscenarios. For example, clearly some initial hands conﬁg-urations that are not compatible with the rule of Gongzhuwill have probability 0. A natural way of updating belief isBayes’ rule. We will cover more details on calculating theouter expectation on section 5.3 and 5.4.A player’s goal is to optimize its strategy on every decision node, assuming that she knows other players’ strategy π − i : max π i v i ( π i , π − i , c i , h ) (2)She will choose an action on every decision node to maxi-mize her expected payoff. Gongzhu ends after all 52 cardsare played. The extensive form of the game can be regardedas a × -layer decision tree. Thus in principle, Bayesperfect equilibrium (Tadelis, 2013) can be solved by back-ward induction. However we do not to solve for exact Bayesperfect equilibrium here because (i) compared with obtain-ing a Bayes perfect equilibrium policy that is unexploitableby others, it’s more useful to obtain a policy that can beatmost other high level players. (ii) an exact Bayes perfectequilibrium is computationally infeasible.To obtain a strong policy, we train a neural network by self-play. Our neural network has a fully connected structure(see supplementary material 8.3). We divide training andtesting into two separate processes. In training, we train thisneural network to excel under perfect information, and intesting, we try to replicate the performance of model underperfect information by adding Bayesian inference describedin section 5.3 and 5.4.To train the neural network, we assume each player knowsnot only his or her initial hands, but also the initial hands ofother players (see supplementary material 8.7). In the ter-minology of computer bridge, this is called double dummy .Then the outer expectation in equation (1) can be removedsince each player knows exactly what other players havein their hands at any time of the game. The use of perfectinformation in training has two beneﬁts, ﬁrstly the random-ness of hidden information is eliminated thus the trainingof neural network becomes more stable, secondly sincesampling hidden information is time consuming, using per-fect information can save time. Although this treatmentmay downplay the use of strategies like blufﬁng in actualplaying, the trained network performs well. Inspired byAlphaGo Zero (Silver et al., 2018), we use Monte Carlo treesearch as a policy improvement operator to train the neuralnetwork (see supplementary material 8.6). The loss functionis deﬁned as: (cid:96) = Div KL ( p MCTS || p nn ) + λ | v MCTS − v nn | . (3)where p MCTS and v MCTS is the policy and value of a nodereturned by Monte Carlo tree search from that node, and p nn and v nn are the output of our neural network. The parameter λ here weights between value loss and policy loss. Sincethe value of a speciﬁc node can sometimes be as large asseveral hundred while KL divergence is at most 2 or 3, thisparameter is necessary. For more details in training, seesupplementary material 8.8.For actual testing or benchmarking, however, we must com-ply with rules and play with honesty. On the other hand, the crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning use of Monte Carlo tree search requires each player’s handto allow each player perform tree search. To bridge the gapwe use stratiﬁed and importance sampling to estimate theouter expectation of equation (1). We sample N scenariosby our stratiﬁed sampling, then use MCTS with our policynetwork as default value estimator to calculate the Q valuefor each choice. Then we average these Q values with an im-portance weight of hidden information. Finally, we choosethe cards with the highest averaged Q value to play. Detailsof stratiﬁed and importance sampling will be discussed insection 5.3 and 5.4. In ﬁgure 2, we can see that the neuralnetwork can improve itself steadily. Figure 2.

AI trained with perfect information and tested in standardGongzhu game. Testing scores are calculated by WPG describedin section 5.2. Raw network means that we use MCTS for onestep, with the value network as default evaluater. Mr. Random, Ifand Greed are three human experience AIs described in section5.1. Every epoch takes less than one minute on a single Nvidia2080Ti GPU.

5. Methodology

To expand the strategy space, we build a group of humanexperience based AIs using standard methods. We namethem Mr. Random, Mr. If, and Mr. Greed. Among them,Mr. Greed is the strongest. The performance of these AIsare shown in table 2. More details can be found in supple-mentary material 8.2.

We evaluate the performance of our AIs by letting two copiesof AI team up as partners and play against two Mr. Greeds,and calculate their average winning points. We call thisWinning Point over Mr. Greed (WPG). In typical evaluation,we run × of games. We show that this evaluationsystem is well-deﬁned in supplementary material 8.10. R I G SZSR 0 194 275 319I -194 0 80 127G -275 -80 0 60SZS -319 -127 -60 0 Table 2.

Combating results for 4 different AIs. In this table, Rstands for Mr. Random, I for Mr. If, G for Mr. Greed, SZS forScrofaZeroSimple. R, I and G are human experience AIs describedin Section 5.1 and supplementary material 8.2. SZS is ScrofaZerowithout IEC algorithm described in section 5.4.

Given a history h , initial cards on player i ’s hand c i , oneshould estimate what cards other players have in their hands.As discussed before, we denote it by c − i and use c − i andscenario interchangeably. We use C ( c i ) to denote the set ofall possible c − i ’s.The most natural way to calculate one’s belief about the dis-tribution of scenarios in C ( c i ) is Bayesian inference (Tadelis,2013). From Bayesian viewpoint, if player’s strategy proﬁleis π i =0 , , , , which we now assume to be common knowl-edge, player i ’s belief after observing history h that initialcards in other players’ hands are c − i is p ( c − i | h ) = p ( h | c − i ; π ) p ( c − i ) (cid:80) e ∈C ( c i ) p ( h | e ; π ) p ( e ) (4)where p ( h | c − i ; π ) is the probability that history h is gen-erated if initial card conﬁguration is c − i , and players playaccording to strategy proﬁle π i =0 , , , . We omit the depen-dence on π when there is no confusion.The belief is important because players use it to calculatethe outer expectation in equation (1) E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )] (5)where v π ( h, c i , c − i ) is the value function. The exact cal-culation of (5) through (4) requires calculating all possibleconﬁgurations in set C ( c i ) , which can contain ∼ elements. Such size is computationally intractable, we there-fore seek to estimate it by Monte Carlo sampling. The naiveapplication of Monte Carlo sampling can bring large vari-ance to the estimation, thus we will derive a stratiﬁed im-portance sampling approach to obtain high quality samples.For trick-taking games, there are always cases where somekey cards are much more important than the others. Thesecards are usually associated with high variance in Q value,see section 6.2. Especially true is this for Gongzhu. We usestratiﬁed sampling to exhaust all possible conﬁguration ofimportant cards.More speciﬁcally, we ﬁrstly divide the entire C ( c i ) intoseveral mutually exclusive strata { S , S , ...S p } , such that crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning C ( c i ) = (cid:83) tj =1 S j . Each stratum represents one im-portant cards conﬁguration. To generate the partition { S , S , ...S t } , we identify the key cards { c k , c k , ...c kq } in c − i based on the statistics of trained neural network,see section 6.2, then exhaust all possible conﬁgurationsof { c k , c k , ...c kq } . After obtaining the partition, we sampleinside each stratum. More formally, by conditional expecta-tion rule, we can rewrite equation (5) as t (cid:88) j =1 p ( S j ) E c − i ∼ p ( c − i | h,S j ) [ v π ( h, c i , c − i )] (6)where p ( S j ) is the probability that c − i is in stratum S j , p ( S j ) = E c − i ∼ p ( c − i | h ) (cid:2) c − i ∈ S j (cid:3) , and p ( c − i | h, S j ) is theprobability distribution of c − i given the history h and thefact that c − i is in stratum S j . As a zero-th order approxima-tion, we set p ( S j ) = t for all j.Since the expectation in equation (6) is still analyticallyintractable, we employ importance sampling to bypass theproblem. If we can obtain a sample from a simpler distri-bution q ( c − i ) , which has common support with p ( c − i | h ) ,then by Radon-Nikodym theorem: E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )]= E c − i ∼ q ( c − i ) (cid:20) v π ( h, c i , c − i ) p ( c − i | h ) q ( c − i ) (cid:21) (7)where we call the term p ( c − i | h ) q ( c − i ) posterior distribution cor-rection. If we draw N samples C N = { c (1) − i , c (2) − i , ...c ( N ) − i } from q ( c − i ) : E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )] ≈ N N (cid:88) k =1 v π ( h, c i , c ( k ) − i ) p ( c ( k ) − i | h ) q ( c ( k ) − i ) (8)We take q ( c − i ) to be the following distribution: q ( c − i ) = (cid:40) / |C ( c − i ) | if c − i is compatible with history otherwise (9)i.e. q ( c − i ) is a uniform distribution for all c − i that is com-patible with history. Compatible with history means undersuch conﬁguration actions in history h do not violate anyrules.Since the ratio p ( c − i | h ) q ( c − i ) is still intractable, we use ˆ p ( c ( k ) − i | h ) = p ( h | c ( k ) − i ) p ( c ( k ) − i ) (cid:80) Nl =1 p ( h | c ( l ) − i ) p ( c ( l ) − i ) , ˆ q ( c − i ) = 1 N (10)to approximate p ( c ( k ) − i | h ) and q ( c ( k ) − i ) . Equation (10) changethe the scope of summation on the denominator of (4) from the entire population to only samples. Then equation (7)reduces to: E c − i ∼ p ( c − i | h ) [ v π ( h, c i , c − i )] ≈ (cid:80) Nk =1 s ( c ( k ) − i ) N (cid:88) l =1 v π ( h, c i , c ( l ) − i ) s ( c ( l ) − i ) (11)where s ( c ( k ) − i ) is the score we assign to scenario c ( k ) − i , it isdeﬁned as s ( c ( k ) − i ) = p ( h | c ( k ) − i ) p ( c ( k ) − i ) . We will introduce analgorithm to calculate the score in section 5.4. In this section, we will focus on how to compute s ( c − i ) .We assume that other players are using similar strategiesto player i . Then the policy network of ScrofaZero can beused to estimate p ( h | c i ) . To continue, we deﬁne correctionfactor γ for a single action as γ ( a, h, c j ) = e − β · regret = e − β ( q max − q a ) , (12)to be the unnormalized probability of player j taking action a under the assumption of j using similar strategies to player i . In deﬁnition (12), h is the history before action a , c j thehand cards for player j , q a the policy network output forplayer j ’s action a and q max the greatest value in outputsof legal choices in c j , β a temperature controlling level ofcertainty of our belief. Then the p ( h | c − i ) in formula (4) canbe written as p ( h | c − i ) = p ( h u | c − i ) = u − (cid:89) t =0 p ( a t +1 | h t , c j ( t +1) )= u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) (cid:80) α ∈ lc ( t ) γ ( α, h t , c j ( t +1) ) , (13)where lc ( t ) is legal choices at stage t . As a generalizedapproach of Bayesian treatment, we estimate p ( h | c − i ) withproducts of correction factors . We call this algorithm Inte-gral over Equivalent Class (IEC). The pseudocode for IECis as algorithm 1. As an attention mechanism and to savecomputation resources, some “important” history slices areselected based on statistics in section 6 in calculating thescenario’s score in algorithm 1, see supplementary material8.12 for detail.Compared with naive Bayes weighting, our IEC weightingis insensitive to variations of number of legal choices thusmore stable. Experiments show that IEC can outperformnaive Bayes weighting by a lot, see table 3.In rest of this section we will explain the intuition of integralover equivalent class . We begin by introducing the conceptof irrelevant cards . Irrelevant cards should be the cardswhich (i) will not change both its correction factor and other crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning

Algorithm 1

Integral over Equivalent Class (IEC)

Input: history h u , player i ’s initial card c i , one possiblescenario c − i . s ( c − i ) ← for t = u − , u − , ..., do h ← h t , c ← c j ( t +1) , a ← a t +1 if a is important then s ( c − i ) ← γ ( a, h, c ) s ( c − i ) end ifend forOutput: Score for scenario s ( c − i ) .Techniques Performance Win(+Draw) RateUS . ± . . . SS . ± . . . US with IEC . ± . -SS with IEC . ± . . . SS with IEC(against US) . ± . . . Table 3.

Performance after different methods. US stands for Uni-formly Sampling, SS for Stratiﬁed Sampling, IEC for

Integral overEquivalent Class . The sampling number of US is set to 9 such thatthe sampling number will equal to that of SS. The last line of thistable is ScrofaZero with the strongest sampling technique, SS withIEC, against itself without any method. cards’ correction factors if it is moved to others’ hand, and(ii) will not change its correction factor if other cards aremoved. The existence of approximate irrelevant cards canbe conﬁrmed both from experiences of playing games orfrom the statistics in section 6. In ﬁgure 4 of section 6.1, wesee that there are some cards whose variance of values aresmall. These cards are candidates approximate irrelevantcards . See supplementary material 8.11 for an concreteexample.We call two distributions of cards only different in irrelevantcards equivalent. This equivalent relation divides all scenar-ios C ( c − i ) into equivalent classes. We denote the equivalentclass of scenario c − i as [ c − i ] . We should integrate overthe whole equivalent class once we get the result of onerepresent element, because the MCTS procedure for eachscenario is expensive. The weight of one equivalent classshould be p ( h u | c − i ) p ([ c − i ]) = (cid:88) all permutations ofirrelevant cards u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J j ( t +1) (14)where j ( t ) is the player who played at stage t , J j ( t ) thesum of correction factors of irrelevant cards in j ( t ) and Y t the sum of correction factors of other cards in j ( t ) . Y may change in different scenarios but (cid:80) j =1 J j will keep unchanged by deﬁnition, denoted by J .Follow the steps in supplementary material 8.14, we canobtain the result of the summation in equation (14) p ( h u | c − i ) p ([ c − i ]) = 3 N u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ O ( ξ ) (15)where N is the number of irrelevant cards, J the sum ofcorrection factor of all irrelevant cards and ξ a real numberbetween and / . One can see the supplementary material8.14 for detail.But notice that, the denominators in the result of (15) areinsensitive to change of Y because both Y and J are alwaysgreater than 1 (see section 6.1 for the magnitude of Y and J ).For scenarios in different equivalent classes, their Y ’s mightbe different but the J is always the same. So the integralremains approximately the same. Thus we can ignore thedenominators when calculating scenario’s score. Or in otherwords, we can use the product of unnormalized correctionfactors as scenario’s score. This is exactly the procedure ofIEC.

6. Empirical Analysis

To begin with, we present some basic statistics for neuralnetwork. They include mean and variance of value andcorrection factor γ of playing a card. The average correctionfactor γ shown in ﬁgure 3 reﬂect cards are “good” or not:the higher correction factor, the better to take such action.We can ﬁnd in the ﬁgure that, SQ is not a “good” card,one would better not play it. Another ﬁnding is that thecorrection factor of four suits peak at around 10, thus whenone player choose between cards lower than 10, she shouldplay the largest one.However, for Gongzhu, the value of a card highly dependson the situation. Hence it’s important to study the vari-ance of correction factor. For example, a SQ will bringlarge negative points ends in your team but will bring largeproﬁts ends in your opponent. Variance of values shownin ﬁgure 4 illustrates the magnitude of risk when dealingwith corresponding cards. We can see that SQ’s variance islarge, which is in line with our analysis. Meanwhile, heartcards should be dealt differently under different situations.Among heart suit, HK and HA are especially important.This may be the result of ﬁnesse technique and all hearts inGongzhu.These statistics from ScrofaZero reveal which cards are im-portant. This information is used in stratiﬁed sampling insection 5.3. Also, they are consistent with human experi-ence. crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning Figure 3.

Average value of correction factor for different cards.The β used here is equal to . . Figure 4.

Variance of values for different cards.

The best classical AI Mr. Greed explained in Section 5.1 hasmany ﬁne-tuned parameters, including the value for eachcard. For example, although SA and SK are not directlyassociated with points in the rule of Gongzhu, they havea great chance to get the most negative card SQ whichweights − . So SA and SK are counted as − and − points respectively in Mr. Greed. These parametersare ﬁne-tuned by human experience. In the area of chess,people have compared the difference for chess piece relativevalue between human convention and deep neural networkAI trained from zero (Tomaˇsev et al., 2020). Here we willconduct a similar analysis to our neural network AI and Mr.Greed’s parameters. Table 4 shows experience parametersin Mr. Greed for some important cards and ScrofaZero’soutput under typical situations. Negative means that cardis a risk or hazard, thus it is better to get rid of it, while positive value has the opposite meaning. We can see thatMr. Greed and ScrofaZero agrees with each other very well.Cards Mr. Greed ScrofaZeroSA − − ∼ − SK − − ∼ − CA − − CK − − CQ − − CJ − − DA

30 20 DK

20 10 DQ

10 10

Table 4.

The experience parameters in Mr. Greed and output ofScrofaZero. Negative means that card is a burden, positive the op-posite. ScrofaZero’s values are estimated under typical situations.

7. Conclusion

In this work we introduce trick-taking game Gongzhu as anew benchmark for incomplete information game. We trainScrofaZero, a human expert level AI capable of distillinginformation and updating belief from history observations.The training starts from tabula rasa and does not need do-main of human knowledge. We introduce stratiﬁed samplingand IEC to boost the performance of ScrofaZero.Future research directions may include designing bettersampling techniques, incorporating sampling into neuralnetwork, and applying our methods to other trick-takinggames like contract bridge. Also, we believe the knowledgein training ScrofaZero can be transferred to other real worldapplications where imperfect information plays a key rolefor decision making.

References

Brown, N. and Sandholm, T. Solving imperfect-informationgames via discounted regret minimization. In

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence , 2019.Brown, N., , and Sandholm, T. Superhuman ai for multi-player poker.

Science , 2019.Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowl-ing, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samoth-rakis, S., and Colton, S. A survey of monte carlo treesearch methods.

IEEE Transactions on ComputationalIntelligence and AI in Games , 4, 2012.Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochas-tic and nonstochastic multi-armed bandit problems.

Foun-dations and Trends in Machine Learning , 5, 2012.Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omid- crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning shaﬁei, S., Balduzzi, D., and Jaderberg, M. Real worldgames look like spinning tops. In

NeurIPS , 2020.Doyle, A. C.

The Sign of Four .Grill, J.-B., Altch´e, F., Tang, Y., Hubert, T., Valko, M.,Antonoglou, I., and Munos, R. Monte-carlo tree searchas regularized policy optimization. In

International Con-ference on Machine Learning , 2020.Kuhn, H. W. Simpliﬁed two-person poker.

Contributions tothe Theory of Games , 1950.Letcher, A. On the impossibility of global convergencein multi-loss optimization. In

International Conferenceon Learning Representations , 2021. URL https://openreview.net/forum?id=NQbnPjPYaG6 .Letcher, A., Balduzzi, D., Racani‘ere, S., Martens, J., Foer-ster, J., Tuyls, K., and Graepel, T. Differentiable gamemechanics.

Journal of Machine Learning Research , 2019.Li, J., Koyamada, S., Ye, Q., Liu, G., Wang, C., Yang,R., Zhao, L., Qin, T., Liu, T.-Y., and Hon, H.-W.Suphx: Mastering mahjong with deep reinforcementlearning. arXiv , 2020. URL https://arxiv.org/abs/2003.13590 .OpenAI, Berner, C., Brockman, G., Chan, B., Cheung, V.,Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme,S., Hesse, C., J´ozefowicz, R., Gray, S., Olsson, C., Pa-chocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman,J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S.,Sutskever, I., Tang, J., Wolski, F., and Zhang, S. Dota 2with large scale deep reinforcement learning. 2019. URL https://arxiv.org/abs/1912.06680 .Rong, J., Qin, T., and An, B. Competitive bridge biddingwith deep neural networks.

Proceedings of the 18th Inter-national Conference on Autonomous Agents and MultiA-gent Systems , 2019.Selten, R. Reexamination of the perfectness concept forequilibrium points in extensive games.

Internationaljournal of game theory , 1975.Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,Aja Huang, A. G., Hubert, T., Baker, L., Lai, M., Bolton,A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van denDriessche, G., Graepel, T., and Hassabis, D. Masteringthe game of go without human knowledge.

Nature , 2017.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-pel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. Ageneral reinforcement learning algorithm that masterschess, shogi, and go through self-play.

Science , 2018. Southey, F., Bowling, M. P., Larson, B., Piccione, C., Burch,N., Billings, D., and Rayner, C. Bayes’ bluff: Opponentmodelling in poker.

Proceedings of the Twenty-First Con-ference on Uncertainty in Artiﬁcial Intelligence , 2005.Tadelis, S.

Game Theory: An Introduction . Princeton Uni-versity press, 41 William Street, Princeton, New Jersey,2013.Tomaˇsev, N., Paquet, U., Hassabis, D., and Kramnik, V.Assessing game balance with alphazero: Exploring alter-native rule sets in chess, 2020.Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M.,Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds,T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Dani-helka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P.,Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T.,Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine,T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring,R., Yogatama, D., W¨unsch, D., McKinney, K., Smith,O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis,D., and an David Silver, C. A. Grandmaster level in star-craft ii using multi-agent reinforcement learning.

Nature ,2019.Whitehouse, D.

Monte Carlo Tree Search for games withHidden Information and Uncertainty . PhD thesis, Univer-sity of York, 7 2004.Wikipedia. Computer bridge, 2021. URL https://en.wikipedia.org/wiki/Computer_bridge .Zinkevich, M., Johanson, M., Bowling, M., and Piccione,C. Regret minimization in games with incomplete in-formation. In , 2007. crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning

8. Experimental Details and Extend Data

Gongzhu belongs to the class trick-taking, which is a largeset of games including bridge, Hearts, Gongzhu and Shengji.We dedicate this section to familiarizing readers with trick-taking games. Trick-taking games share the following com-mon rules.1. A standard 52-card deck is used in most cases.2. Generally, there are four players paired in partnership,with partners sitting opposite to each other around atable.3. Cards are shufﬂed and dealt to four players at the be-ginning.4. As the name suggests, trick-taking game consists of anumber of tricks. In a trick, four players play one cardsequentially by the following rules:• The player leading the ﬁrst trick is chosen ran-domly or by turns. The ﬁrst card of each trick canbe any card in that player’s hand.• Following players should follow the suit if pos-sible. There are no limits on the ranking of thecards played.• At the end of each trick, four cards played areranked and the player who played the card ofhighest rank becomes the winner.• The winner of the last trick leads the next trick.• The playing order is usually clockwise.5. The cards are usually ranked by: A K Q J 10 9 8 7 6 54 3 2.

We build a group of human experience based AIs usingstandard methods. The group includes1. Mr. Random: Random player choosing cards fromlegal choices randomly.2. Mr. If: A program with 33 if statements representinghuman experience. Mr. If can outperform Mr. Randoma lot with such a few number of ifs.3. Mr. Greed: AI with hundreds of if statements and par-tial backward induction. It contains many handcraftedhyper-parameters to complete the search. Mr. Greedcan outperform Mr. If, but not proportional to theirnumber of if statements.The performances of these AIs are shown in table 2. Moredetails can be found in our repository.

We use fully connected layer with skip connection as ourmodel basic block. The input of our neural network isencoded into a 434-dimensional onehot vector (this will beexplained in detail in the next subsection). The output isa 53 dimensional vector with the ﬁrst 52 elements to be p vector and the last element to be v .We also tried other network topologies, including shallowerfully connection layers and ResNet. Their performance areshown in table 5.Network Topology ± Fully Connection 24 11416299 ± ResNet 18 11199093 − ± Table 5.

Performance of different networks. These scores are eval-uated by WPG introduced in section 5.2. Sufﬁcient rounds ofgames are played to make sure the variance is small enough.

Since ResNet does not show signiﬁcant improvement inperformance, we stick to fully connected neural network formost of our experiments.

The input of our neural network is a × × × dimension vector.• × denotes the hands of 4 players. becauseGongzhu uses standard 52-card deck. Cards of theother 3 players are guessed by the methods discussedin Section 5.3.• × denotes the cards played in this trick. We choosethese format because at most 3 cards are played beforethe speciﬁed player can play. We use instead of to represent the cards due to the diffuse technique described in the next subsection.• × denotes the cards associated with points thatis already played. In the game Gongzhu, there are cards which have scores. We use diffuse technique for representing cards in this trickwhen preparing inputs. Normally we should set one spe-ciﬁc element of onehot vector corresponding to a card to .However, we want to amplify the input signal. Hence weset not only the element in the onehot vector correspond-ing to this card, but also the two adjacent elements, to 1.Also we extend length for representing each card from and to to diffuse the element at two endpoints. Figure5 shows how diffuse technique works. Input in this form crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning can be transformed to and from standard input with a singlefully connection layer. In experiments, diffuse techniqueaccelerates training. Figure 5.

We use diffuse technique when representing cards in thistrick to accelerate training. Above shows the normal input, whilebelow shows the input after using diffuse technique.

We used the standard MCTS using the value network toevaluate positions in the tree. We use UCB (Upper Conﬁ-dence Bound)(Bubeck & Cesa-Bianchi, 2012) method toselect a child node. More speciﬁcally, for each selection,we choose node i = argmax j v j + c (cid:112) ln N/n j , where v j isthe value of node j , n j is how many times node j has beenvisited, N is the total visit round, c is the exploration con-stant. There are two hyperparameters important in MCTS:exploration constant c and search number T MCTS . We keepexploration constant to c = 30 . As the search number, weset T MCTS = 2 × { legal choice number } when training and T MCTS = 10 + 2 × { legal choice number } when evaluat-ing. The search number is crucial in training and will bediscussed in the next subsection. The network is trained under perfect information where itcan see cards in others’ hands. Or in other words, we donot need to sample hidden information during training. Thissetting might be preferred because1. without sampling, training is more robust;2. by this means, more diversiﬁed middlegames and end-ings can be reached, which helps neural network im-prove faster by learning from different circumstances.For example, we ﬁnd neural network trained with perfectinformation masters the techniques in all hearts much morefaster than one trained with imperfect information.After each MCTS search and play, the input and the re-wards for different cards will be saved in buffer for train-ing. The search number in MCTS is crucial. When itis small, the neural network of course will not improve.However, we ﬁnd that, when search number in MCTS istoo large, the neural network will again not improve, oreven degrade! We ﬁnd × { legal choice number } the mostsuitable number for MCTS searching. Notice that, with × { legal choice number } searches, MCTS can only pre-dict the future approximately after two cards are played. Itis quite surprising that the neural network can improve andﬁnally acquire “intuition” for long term future. We typically let four AIs play 64 games then train for onebatch. What’s more, inspired by online learning, we willalso let neural network review data of last two batches.So the number of data points in each batch is × × . Then the target function (3) is optimized byAdam optimizer with lr = 0 . and β = (0 . , . for3 iterations.There is one thing special in our dealing with loss functionwhich deserves some explanations. Normally, it’s natural tomask the illegal choice in the probability output of network(i.e. mask them after the softmax layer). However, we maskthe output before softmax layer. Or in coding language,we use softmax ( p × legal mask ) rather than legal mask × softmax ( p ) . We ﬁnd this procedure much better than theother one. A possible explanation is that, if we multiply themask before softmax, the information of illegal choice isstill preserved in the loss, which can help the layer beforeoutput to ﬁnd a more reasonable feature representation. We provide an arena for different AIs to combat. We deﬁnestandard protocols for AIs and arena to commute. We setup an server welcoming AIs from all around the world.Every AI obeying our protocol can combat with our AIsand download detailed statistics. We also provide data andstandard utility functions for others training usage. Moredetails can be found in our GitHub repository.

In section 5.2, we introduced Winning Point over Mr. Greed(WPG). For this evaluation system to be well-deﬁned, WPGshould be transitive, or at least approximately transitive,i.e. program with higher WPG should play better. In thissection, We will introduce a statistics ε that measures theintransitivity of an evaluation function, then show that WPGis nearly transitive by numerical experiments.We ﬁrst deﬁne ξ ij to be the average winning score of play-ing strategy π i against strategy π j . Then let us start byconsidering two extreme cases. A game is transitive on Π ,if ξ ij + ξ jk = ξ ik ∀ π i , π j , π k ∈ Π , (16)where Π is a subspace of the strategy space. As afamous example for intransitive game, 2-player rock-paper-scissor game is not transitive for strategy tuple ( Always play rock , Always play paper , Always play scissor ) . crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning In the middle of the totally transitive and intransitive games,we want to build a function ε Π to describe the transitivityof policy tuple Π = ( π , π , ...π n ) . To better character-ize the intransitivity, the function ε should have followingproperties:a) take value inside [0 , , and equal when evaluationsystem is totally transitive, when totally intransitive;b) be invariant under translation and inﬂation of scores;c) be invariant under reindexing of π i ’s in Π ;d) take use of combating results of every triple strategies ( π i , π j , π k ) in Π . There are C n = n ( n − n − / such triples.e) will not degenerate (i.e. approach to 0 or 1) underinﬁnitely duplication of any π i ∈ Π ;f) be stable under adding similar level strategies into Π .We deﬁne ε to be ε Π = (cid:80) i

Table 6.

An example for irrelevant card. Here D4 is the irrelevantcard. γ is correction factor of corresponding cards. From ﬁgure 4, we know that irrelevant cards are most likelyappear in spade. Table 7 is an example for irrelevant card SJ.Apart from SJ, other small cards, S2 to S7 are also highlylikely to be irrelevant cards in this situation.Cards Guessed γ Without SJ γ With SJS5 0.9424 0.9464S10 1.0000 1.0000SJ - 0.9598S8 0.9560 0.9585S10 1.0000 1.0000SQ 0.4818 0.4950SJ - 0.9528S7 1.0000 1.0000SK 0.8302 0.8331SJ - 0.9677

Table 7.

An example for irrelevant card. Here SJ is the irrelevantcard. γ is correction factor of corresponding cards. Like what we have discussed in section 5.4, as an atten-tion mechanism and to save computation resources, some“important” history slices are selected based on statistics insection 6 in IEC algorithm. From ﬁgure 3 and 4, we can see crofaZero: Mastering Trick-taking Poker Game Gongzhu by Deep Reinforcement Learning that, the cards smaller than 8 always have small correctionfactors in its suit and lower value variance. This meansthat the history slices with card played smaller than 8 arehighly likely to be unimportant. So we only select historyslices with the card played greater than 7 in IEC algorithm.Another selection rule is that, when a player is followingthe suit, history slices of other suits are also not important.

In section 5.4, we introduced IEC algorithm. However, ournetwork’s input requests other’s hands. We should not givehands of this scenario directly to neural network, becausethat player cannot know the exact correct hands of otherplayers. As a work around, we average the hands informa-tion and give it to neural network, shown in ﬁgure 6. Inother words, we replace the “onehot” in input representingother’s hands with “triple- / -hot”. Figure 6.

The network input representing other’s hands is averagedin IEC.

The replacement from “one” to “triple- / ”is not standard.Table 8 shows that the performance will not deteriorateunder this nonstandard input.Input method WPGStandard input (“onehot”) ± Averaged Input (“triple- / -hot”) ± . Table 8.

Raw network performance under standard input and av-eraged input. Raw network means that we directly use policynetwork in playing rather than MCTS. (15)Once irrelevant cards are deﬁned, c − i is divided in to threeparts: cards already played, relevant cards and approxi-mately irrelevant cards. Form now on, we will refer ap-proximately irrelevant cards as irrelevant cards. The J j inequation (14) is the sum of all of irrelevant cards’ correc-tion factors in player j , while the Y t in equation (14) is thesum of the other cards’ correction factors of the player whoplayed stage t . (cid:88) c k ∈ irrelevant cardsin player j γ ( c k ) (cid:44) J j , (cid:88) c k ∈ cards played ∪ relevant cards γ ( c k ) (cid:44) Y t (21)Notice that by deﬁnition of irrelevant cards , (cid:80) j =1 J j should keep unchanged in different scenarios for this deci-sion node, denoted by J . J + J + J = Const (cid:44) J. (22)The distribution of J j is polynomial. In most situations,there are always many (5 or more) irrelevant cards. Bycentral limit theorem, a polynomial distribution can be ap-proximated by a multivariate normal distribution. Since J = J − J − J , we can derive the marginal distributionof J and J . We adopt this approximation and replacethe summation over all permutations of irrelevant cards inequation (14) by an integral p ( h u | c − i ) ≈ N π | Σ | (cid:90) (cid:90) x + x − J/ exp (cid:18) −

12 ( x , x )Σ − ( x , x ) T (cid:19) u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ x j ( t +1) d x d x = 3 N J π | Σ | u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ O ( ξ )= 3 N u − (cid:89) t =0 γ ( a t +1 , h t , c j ( t +1) ) Y t +1 + J/ O ( ξ ) (23)where N is the number of irrelevant cards, J the sum ofcorrection factor of all irrelevant cards, x j = J j − J/ and ξ a real number between and / . The last equality isbecause Σ satisﬁes π | Σ | (cid:90) (cid:90) x + x < J x , > − J e − ( x ,x )Σ − ( x ,x ) T d x d x =1 − O ( ξ ) ..