[PDF] Complexity and Algorithms for Exploiting Quantal Opponents in Large Two-Player Games

Abstract

Solution concepts of traditional game theory assume entirely rational players; therefore, their ability to exploit subrational opponents is limited. One type of subrationality that describes human behavior well is the quantal response. While there exist algorithms for computing solutions against quantal opponents, they either do not scale or may provide strategies that are even worse than the entirely-rational Nash strategies. This paper aims to analyze and propose scalable algorithms for computing effective and robust strategies against a quantal opponent in normal-form and extensive-form games. Our contributions are: (1) we define two different solution concepts related to exploiting quantal opponents and analyze their properties; (2) we prove that computing these solutions is computationally hard; (3) therefore, we evaluate several heuristic approximations based on scalable counterfactual regret minimization (CFR); and (4) we identify a CFR variant that exploits the bounded opponents better than the previously used variants while being less exploitable by the worst-case perfectly-rational opponent.

Full PDF

CComplexity and Algorithms for Exploiting Quantal Opponentsin Large Two-Player Games

David Milec, Jakub ˇCern´y , Viliam Lis´y , Bo An Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore

Abstract

Solution concepts of traditional game theory assume entirelyrational players; therefore, their ability to exploit subrationalopponents is limited. One type of subrationality that describeshuman behavior well is the quantal response. While thereexist algorithms for computing solutions against quantal op-ponents, they either do not scale or may provide strategiesthat are even worse than the entirely-rational Nash strate-gies. This paper aims to analyze and propose scalable algo-rithms for computing effective and robust strategies against aquantal opponent in normal-form and extensive-form games.Our contributions are: (1) we deﬁne two different solutionconcepts related to exploiting quantal opponents and analyzetheir properties; (2) we prove that computing these solutionsis computationally hard; (3) therefore, we evaluate severalheuristic approximations based on scalable counterfactual re-gret minimization (CFR); and (4) we identify a CFR variantthat exploits the bounded opponents better than the previouslyused variants while being less exploitable by the worst-caseperfectly-rational opponent.

Extensive-form games are a powerful model able to describerecreational games, such as poker, as well as real-world situ-ations from physical or network security. Recent advances insolving these games, and particularly the Counterfactual Re-gret Minimization (CFR) framework (Zinkevich et al. 2008),allowed creating superhuman agents even in huge games,such as no-limit Texas hold’em with approximately different decision points (Moravk et al. 2017; Brown andSandholm 2018). The algorithms generally approximate aNash equilibrium, which assumes that all players are per-fectly rational, and is known to be inefﬁcient in exploitingweaker opponents. An algorithm that would be able to takean opponent’s imperfection into account is expected to winby a much larger margin (Johanson and Bowling 2009; Bardet al. 2013).The most common model of bounded rationality in hu-mans is the quantal response (QR) model (McKelvey andPalfrey 1995, 1998). Multiple experiments identiﬁed it asa good predictor of human behavior in games (Yang, Or-donez, and Tambe 2012; Haile, Hortac¸su, and Kosenok hearth of the algorithms success-fully deployed in the real world (Yang, Ordonez, and Tambe2012; Fang et al. 2017). It suggests that players respondstochastically, picking better actions with higher probabil-ity. Therefore, we investigate how to scalably compute agood strategy against a quantal response opponent intwo-player normal-form and extensive-form games. If both players choose their actions based on the QRmodel, their behavior is described by quantal response equi-librium (QRE). Finding QRE is a computationally tractableproblem (McKelvey and Palfrey 1995; Turocy 2005), whichcan be also solved using the CFR framework (Farina, Kroer,and Sandholm 2019). However, when creating AI agentscompeting with humans, we want to assume that one of theplayers is perfectly rational, and only the opponent’s ra-tionality is bounded . A tempting approach may be using thealgorithms for computing QRE and increasing one player’srationality or using generic algorithms for exploiting oppo-nents (Davis, Burch, and Bowling 2014) even though the QRmodel does not satisfy their assumptions, as in (Basak et al.2018). However, this approach generally leads to a solutionconcept we call Quantal Nash Equilibrium, which we showis very inefﬁcient in exploiting QR opponents and may evenperform worse than an arbitrary Nash equilibrium.Since the very nature of the quantal response model as-sumes that the sub-rational agent responds to a strategyplayed by its opponent, a more natural setting for study-ing the optimal strategies against QR opponents are Stackel-berg games, in which one player commits to a strategy thatis then learned and responded to by the opponent. Optimalcommitments against quantal response opponents - QuantalStackelberg Equilibrium (QSE) - have been studied in secu-rity games (Yang, Ordonez, and Tambe 2012), and the re-sults were recently extended to normal-form games ( ˇCern´yet al. 2020). Even in these one-shot games, polynomial algo-rithms are available only for their very limited subclasses. Inextensive-form games, we show that computing the QSE isNP-hard, even in zero-sum games. Therefore, it is very un-likely that the CFR framework could be adapted to closelyapproximate these strategies. Since we aim for high scala-bility, we focus on empirical evaluation of several heuristics,including using QNE as an approximation of QSE. We iden-tify a method that is not only more exploitative than QNE,but also more robust when the opponent is rational. a r X i v : . [ c s . A I] S e p ur contributions are: We analyze the relationship andproperties of two solution concepts with quantal opponentsthat naturally arise from Nash equilibrium (QNE) and Stack-elberg equilibrium (QSE). We prove that computing QNEis PPAD-hard even in NFGs, and computing QSE in EFGsis NP-hard. Therefore, we investigate the performance ofCFR-based heuristics against QR opponents. The extensiveempirical evaluation on four different classes of games withup to histories identiﬁes a variant of CFR- f (Davis,Burch, and Bowling 2014) that computes strategies betterthan both QNE and NE. Even though our main focus is on extensive-form games, westudy the concepts in normal-form games, which can be seenas their conceptually simpler special case. After deﬁning themodels, we proceed to deﬁne quantal response and the met-rics for evaluating a deployed strategy’s quality.

Two-player Normal-form Games

A two-player normal-form game (NFG) is a tuple G =( N, A, u ) where N = {△ , ▽} is set of players. We use i and − i for one player and her opponent. A = { A △ , A ▽ } de-notes the set of ordered sets of actions for both players. The utility function u i ∶ A △ × A ▽ → R assigns a value for eachpair of actions. A game is called zero-sum if u △ = − u ▽ . Mixed strategy σ i ∈ Σ i is a probability distribution over A i . For any strategy proﬁle σ ∈ Σ = { Σ △ × Σ ▽ } we use u i ( σ ) = u i ( σ i , σ − i ) as the expected outcome for player i ,given the players follow strategy proﬁle σ . A best response (BR) of player i to the opponent’s strategy σ − i is a strategy σ BRi ∈ BR i ( σ − i ) , where u i ( σ BRi , σ − i ) ≥ u i ( σ ′ i , σ − i ) forall σ ′ i ∈ Σ i . An (cid:15) - best response is σ (cid:15)BRi ∈ (cid:15)BR i ( σ − i ) , (cid:15) > , where u i ( σ (cid:15)BRi , σ − i ) + (cid:15) ≥ u i ( σ ′ i , σ − i ) for all σ ′ i ∈ Σ i .Given a normal-form game G = ( N, A, u ) , a tuple of mixedstrategies ( σ NEi , σ NE − i ) , σ NEi ∈ Σ i , σ NE − i ∈ Σ − i is a NashEquilibrium if σ NEi is an optimal strategy of player i againststrategy σ NE − i . Formally: σ NEi ∈ BR ( σ NE − i ) ∀ i ∈ {△ , ▽} In many situations, the roles of the players are asym-metric. One player (leader - △ ) has the power to committo a strategy, and the other player (follower - ▽ ) plays thebest response. This model has many real-world applica-tions (Tambe 2011); for example, the leader can correspondto a defense agency committing to a protocol to protect crit-ical facilities. The common assumption in the literature isthat the follower breaks ties in favor of the leader. Then, theconcept is called a Strong Stackelberg Equilibrium (SSE).A leader’s strategy σ SSE ∈ Σ △ is a Strong Stack-elberg Equilibrium if σ △ is an optimal strategy of theleader given that the follower best-responds. Formally: σ SSE △ = arg max σ ′△ ∈ Σ △ u △ ( σ ′△ , BR ▽ ( σ ′△ )) . In zero-sumgames, SSE is equivalent to NE (Conitzer and Sandholm2006) and the expected utility is denoted value of the game . Two-player Extensive-form Games

A two-player extensive-form game (EFG) consist of a set ofplayers N = {△ , ▽ , c } , where c denotes the chance. A is a ﬁ-nite set of all actions available in the game. H ⊂ { a a ⋯ a n ∣ a j ∈ A, n ∈ N } is the set of histories in the game. We assumethat H forms a non-empty ﬁnite preﬁx tree. We use g ⊏ h todenote that h extends g . The root of H is the empty sequence ∅ . The set of leaves of H is denoted Z and its elements z are called terminal histories . The histories not in Z are non-terminal histories . By A ( h ) = { a ∈ A ∣ ha ∈ H } we denotethe set of actions available at h . P ∶ H ∖ Z → N is the playerfunction which returns who acts in a given history. Denoting H i = { h ∈ H ∖ Z ∣ P ( h ) = i } , we partition the histories as H = H △ ∪ H ▽ ∪ H c ∪ Z . σ c is the chance strategy deﬁned on H c . For each h ∈ H c , σ c ( h ) is a probability distribution over A ( h ) . Utility functions assign each player utility for eachleaf node, u i ∶ Z → R .The game is of imperfect information if some actions orchance events are not fully observed by all players. The in-formation structure is described by information sets for eachplayer i , which form a partition I i of H i . For any informa-tion set I i ∈ I i , any two histories h, h ′ ∈ I i are indistin-guishable to player i . Therefore A ( h ) = A ( h ′ ) whenever h, h ′ ∈ I i . For I i ∈ I i we denote by A ( I i ) the set A ( h ) andby P ( I i ) the player P ( h ) for any h ∈ I i .A strategy σ i ∈ Σ i of player i is a function that assignsa distribution over A ( I i ) to each I i ∈ I i . A strategy proﬁle σ = ( σ △ , σ ▽ ) consists of strategies for both players. π σ ( h ) is the probability of reaching h if all players play accordingto σ . We can decompose π σ ( h ) = ∏ i ∈ N π σi ( h ) into eachplayer’s contribution. Let π σ − i be the product of all players’contributions except that of player i (including chance). For I i ∈ I i deﬁne π σ ( I i ) = ∑ h ∈ I i π σ ( h ) , as the probability ofreaching information set I i given all players play accordingto σ . π σi ( I i ) and π σ − i ( I i ) are deﬁned similarly. Finally, let π σ ( h, z ) = π σ ( z ) π σ ( h ) if h ⊏ z , and zero otherwise. π σi ( h, z ) and π σ − i ( h, z ) are deﬁned similarly. Using this notation, expectedpayoff for player i is u i ( σ ) = ∑ z ∈ Z u i ( z ) π σ ( z ) . BR, NEand SSE are deﬁned as in NFGs.Deﬁne u i ( σ, h ) as an expected utility given that the his-tory h is reached and all players play according to σ . A coun-terfactual value v i ( σ, I ) is the expected utility given that theinformation set I is reached and all players play accordingto strategy σ except player i , which plays to reach I . For-mally, v i ( σ, I ) = ∑ h ∈ I,z ∈ Z π σ − i ( h ) π σ ( h, z ) u i ( z ) . And simi-larly counterfactual value for playing action a in informationset I is v i ( σ, I, a ) = ∑ h ∈ I,z ∈ Z,ha ⊏ z π σ − i ( ha ) π σ ( ha, z ) u i ( z ) .We deﬁne S i as a set of sequences of actions only forplayer i . inf ( s i ) , s i ∈ S i is the information set where lastaction of s i was executed and seq i ( I ) , I ∈ I i is sequence ofactions of player i to information set I . Quantal Response Model of Bounded Rationality

Fully rational players always select the utility-maximizingstrategy, i.e., the best response. Relaxing this assumptionleads to a “statistical version” of best response, which takesinto account the inevitable error-proneness of humans andallows the players to make systematic errors (McFadden1976; McKelvey and Palfrey 1995).

Deﬁnition 1.

Let G = ( N, A, u ) be an NFG. Function QR ∶ Σ △ → Σ ▽ is a quantal response function of player ▽ if probability of playing action a monotonically increasess expected utility for a increases. Quantal function QR iscalled canonical if for some real-valued function q : QR ( σ, a k ) = q ( u ▽ ( σ, a k ))∑ a i ∈ A ▽ q ( u ▽ ( σ, a i )) ∀ σ ∈ Σ △ , a k ∈ A ▽ . (1)Whenever q is a strictly positive increasing function, thecorresponding QR is a valid quantal response function. Suchfunctions q are called generators of canonical quantal func-tions. The most commonly used generator in the literatureis the exponential (logit) function (McKelvey and Palfrey1995) deﬁned as q ( x ) = e λx where λ > . λ drives themodel’s rationality. The player behaves uniformly randomlyfor λ → , and becomes more rational as λ → ∞ . We denotea logit quantal function as LQR.In EFGs, we assume the bounded-rational player playsbased on a quantal function in every information set sepa-rately, according to the counterfactual values. Deﬁnition 2.

Let G be an EFG. Function QR ∶ Σ △ → Σ ▽ isa canonical couterfactual quantal response function of player ▽ with generator q if for a strategy σ △ it produces strategy σ ▽ such that in every information set I ∈ I ▽ , for each action a k ∈ A ( I ) it holds that QR ( σ △ , I, a k ) = q ( v ▽ ( σ, I, a k ))∑ a i ∈ A ( I ) q ( v ▽ ( σ, I, a i )) , (2)where QR ( σ △ , I, a k ) is the probability of playing action a k in information set I and σ = ( σ △ , σ ▽ ) .We denote the canonical counterfactual quantal responsefunction with the logit generator counterfactual logit quan-tal response (CLQR) . CLQR differs from the traditional def-inition of logit agent quantal response (LAQR) (McKelveyand Palfrey 1998) in using counterfactual values instead ofexpected utilities. The main advantage of CLQR over LAQRis that CLQR deﬁnes a valid quantal strategy even in infor-mation sets unreachable due to a strategy of the opponent,which is necessary for applying regret-minimization algo-rithms explained later.Because the logit quantal function is the most well-studied function in the literature with several deployed appli-cations (Pita et al. 2008; Delle Fave et al. 2014; Fang et al.2017), we focus most of our analysis and experimental re-sults on (C)LQR. Without a loss of generality, we assumethe quantal player is always player ▽ . Metrics for Evaluating Quality of Strategy

In a two-player zero-sum game, the exploitability of a givenstrategy is deﬁned as expected utility that a fully rationalopponent can achieve above the value of the game. For-mally, exploitability E( σ i ) of strategy σ i ∈ Σ i is E( σ i ) = u − i ( σ i , σ − i ) − u − i ( σ NE ) σ − i ∈ BR − i ( σ i ) . We also intend to measure how much we are able toexploit an opponent’s bounded-rational behavior. For thispurpose, we deﬁne gain of a strategy against quantal re-sponse as an expected utility we receive above the value ofthe game. Formally, gain G( σ i ) of strategy σ i is deﬁned as G( σ i ) = u i ( σ i , QR ( σ i )) − u i ( σ NE ) . General-sum games do not have the property that all NEshave the same expected utility. Therefore, we simply mea-sure expected utility against LQR and BR opponents there.

This section formally deﬁnes two one-sided bounded-rational equilibria, where one of the players is rationaland the other subrational – a saddle-point-type equilib-rium called Quantal Nash Equilibrium (QNE) and a leader-follower-type equilibrium called Quantal Stackelberg Equi-librium (QSE). We show that contrary to their fully-rational counterparts, QNE differs from QSE even in zero-sum games. Moreover, we show that computing QSE inextensive-form games is an NP-hard problem.

Quantal Equilibria in Normal-form Games

We ﬁrst consider a variant of NE, in which one of the playersplays a quantal response instead of the best response.

Deﬁnition 3.

Given a normal-form game G = ( N, A, u ) and a quantal response function QR , a strategy proﬁle ( σ QNE △ , QR ( σ QNE △ )) ∈ Σ describes a Quantal Nash Equi-librium (QNE) if and only if σ QNE △ is a best response ofplayer △ against quantal-responding player ▽ . Formally: σ QNE △ ∈ BR ( QR ( σ QNE △ )) . (3)QNE can be seen as a concept between NE and QuantalResponse Equilibrium (QRE) (McKelvey and Palfrey 1995).While in NE, both players are fully rational, and in QRE,both players are assumed to behave bounded-rationally, inQNE, one player is rational, and the other is bounded-rational. Theorem 1.

Computing a QNE strategy proﬁle in two-player NFGs is a PPAD-hard problem.Proof (Sketch).

We do a reduction from the problem of com-puting (cid:15) -NE. We derive an upper bound on a maximumdistance between best response and logit quantal response,which goes to zero with λ approaching inﬁnity. For a given (cid:15) , we ﬁnd λ , such that QNE is (cid:15) -NE. The full proof is pro-vided in the appendix.QNE usually outperforms NE against LQR in practice aswe show in the experiments. However, it cannot be guaran-teed as stated in the Proposition 2. Proposition 2.

For any

LQR function, there exists a zero-sum normal-form game G = ( N, A, u ) with a unique NE - ( σ NE △ , σ NE ▽ ) and QNE - ( σ QNE △ , QR ( σ QNE △ )) such that u △ ( σ NE △ , QR ( σ NE △ )) > u △ ( σ QNE △ , QR ( σ QNE △ )) . The second solution concept is a variant of SSE in situa-tions, when the follower is bounded-rational.

Deﬁnition 4.

Given a normal-form game G = ( N, A, u ) anda quantal response function QR , a mixed strategy σ QSE △ ∈ Σ △ describes a Quantal Stackleberg Equilibrium (QSE) ifand only if σ QSE △ = arg max σ △ ∈ Σ △ u △ ( σ △ , QR ( σ △ )) . (4) Full proofs of all propositions are in the appendix. 3UREDELOLW\RIDFWLRQ; ( [S HF W H G X WLOLW \ /45REMHFWLYH ;YDOXH <YDOXH Game 1A B C DX -4 -5 8 -4Y -5 -4 -4 8Game 2A BX -2 8Y -2.2 -2.5

Figure 1: (Left) An example of the expected utility againstLQR in Game 1. The X-axis shows the strategy of the ratio-nal player, and the Y-axis the expected utility. The X- andY-value curves show the utility for playing the correspond-ing action given the opponent’s strategy is a response to thestrategy on X-axis. A detailed description is in Example 1.(Right) An example of two normal-form games. Each rowof the depicted matrices is labeled by a ﬁrst player strategy,while the second player’s strategy labels every column. Thenumbers in the matrices denote the utilities of the ﬁrst player.We assume the row player is △ .In QSE, player △ is fully rational and commits to a strat-egy that maximizes her payoff given that player ▽ observesthe strategy and then responds according to her quantal func-tion. This is a standard assumption, and even in problemswhere the strategy is not known in advance, it can be learnedby playing or observing. QSE always exists because all utili-ties are ﬁnite, the game has a ﬁnite number of actions, player △ utilities are continuous on her strategy simplex, and themaximum is hence always reached. Observation 3.

Let G be a normal-form game and a q bea generator of a canonical quantal function. Then QSE of G can be formulated as a non-convex mathematical program: max σ △ ∈ Σ △ ∑ a ▽ ∈ A ▽ u △ ( σ △ , a ▽ ) q ( u ▽ ( σ △ , π ▽ )∑ a ▽ ∈ A ▽ q ( u ▽ ( σ △ , a ▽ )) . (5) Example 1.

In Figure 1 we present an example of utilityagainst LQR in Game 1 with λ = . . We show QNE inwhich both actions have the same expected utility. Thereforeit is a best response for Player △ , and she has no incentiveto deviate. However, it is not optimal with regard to maximalexpected utility, which is achieved in two global extremes,both being QSE. We can also observe that even a small gamelike Game 1 can have multiple local extremes in this case 3. Example 1 shows that ﬁnding QSE is a non-concave prob-lem even in zero-sum NFGs, and it can have multiple globalsolutions. Moreover, facing a bounded-rational opponentmay change the relationship between NE and SSE. Theyare no longer interchangeable, even in zero-sum games, andQSE may use strictly dominated actions.

Quantal Equilibria in Extensive-form Games

In EFGs, QNE and QSE are deﬁned in the same manner asin NFGs. However, instead of the normal-form quantal re-sponse, the second player acts according to the counterfac-tual quantal response. QSE in EFGs can be computed by amathematical program provided in the appendix. The natural formulation of the program is non-linear with non-convexconstraints indicating the problem is hard. We show that theproblem is indeed NP-hard, even in zero-sum games.

Theorem 4.

Let G be a two-player imperfect-informationEFG with perfect recall and QR be a quantal response func-tion. Computing an optimal strategy of a rational playeragainst the quantal response opponent in G is an NP-hardif one of the following holds: (1) G is zero-sum and QR isgenerated by a logit generator q ( x ) = exp ( λx ) for some λ > ; or (2) G is general-sum.Proof (Sketch). We reduce from the set partition. The keypart of the constructed EFG is zero-sum. For each item of thepartition problem, the leader chooses an action that placesthe item to one or the other subset. The follower has twoactions; each gives the leader a reward of the sum of itemsin one subset. If the sums are different, the follower choosesthe lower one. If they are the same, the follower chooses bothof them uniformly, which maximizes the leader’s payoff.A complication is that the leader could split each item inhalf by playing uniformly. This is prevented by combiningthe leader’s actions for placing an item with an action in aseparate game with two symmetric QSEs. Such a game is thecollaborative coordination game in the non-zero-sum caseand a game similar to Figure 1 in the zero-sum case.The proof of the non-zero-sum part of Theorem 4 onlyrequires the follower to play action with a higher rewardwith higher probability. This also holds for a rational player;hence, the theorem provides an independent, simpler, andmore general proof of NP-hardness of computing Stack-elberg equilibria in EFGs, which unlike (Letchford andConitzer 2010) does not rely on the tie-breaking rule.

This section describes various algorithms and heuristics forcomputing one-sided quantal equilibria introduced in theprevious section. In the ﬁrst part, we focus on QNE, andbased on an empirical evaluation; we claim that regret-minimization algorithms converge to QNE in both NFGsand EFGs. The second part then discusses gradient-based al-gorithms for computing QSE and analyses cases when regretminimization methods will or will not converge to QSE.

Algorithms for Computing QNE

Counterfactual regret minimization (CFR) (Zinkevich et al.2008) is a state-of-the-art algorithm for approximating NEin extensive-form games. CFR is a form of regret matching(Hart and Mas-Colell 2000) and uses iterated self play tominimize regret at each information set independently. CFR-f (Davis, Burch, and Bowling 2014) is a modiﬁcation capa-ble of computing strategy against some opponent models. Ineach iteration, it performs a CFR update for one player andcomputes the response for the other player. We use CFR-fwith a quantal response and call it CFR-QR. In normal-formgames, we use the same approach with simple regret match-ing (RM-QR).

Conjecture 5. (1) In NFGs, RM-QR converges to QNE. (2)In EFGs, CFR-QR converges to QNE.e performed an empirical evaluation on more than × games. In each game, the resulting strategy of player △ was (cid:15) -BR to the quantal response of the opponent withepsilon lower than − after less than iterations.Furthermore, the performance of QNE is at the cost ofsubstantial exploitability. We propose two heuristics that ad-dress both of the issues simultaneously. The ﬁrst one is toplay a convex combination of QNE and NE strategy. We callthis heuristical algorithm COMB . We aim to ﬁnd a param-eter α of the combination that maximizes the utility againstLQR. However, choosing the correct α is, in general, a non-convex, non-linear problem. We search for the best α bysampling possible α s and choosing the one with the best util-ity. The time required to compute one combination’s valueis similar to the time required to perform one iteration of theRM-QR algorithm. Sampling the α s and checking all thesampled parameters hence does not affect the scalability ofCOMB. The gain is also guaranteed to be greater or equal tothe gain of the NE strategy, and as we show in the results,some combinations achieve higher gains than both the QNEand the NE strategies.The second heuristic uses a restricted response ap-proach (Johanson, Zinkevich, and Bowling 2008), and wecall it restricted quantal response (RQR) . The key ideais that during the regret minimization, we set probability p ,such that in each iteration, the opponent updates her strat-egy using (i) LQR with probability p and (ii) BR otherwise.We aim to choose the parameter p such that it maximizes theexpected payoff. Using sampling as in COMB is not possi-ble, since each sample requires to rerun the whole RM. Toavoid the expensive computation, we start with p = . andupdate the value during the iterations. In each iteration, weapproximate the gradient of gain with respect to p based on achange in the value after both the LQR and the BR iteration.We move the value of p in the gradient’s approximated direc-tion with a step size that decreases after each iteration. How-ever, the strategies do change tremendously with p , and thealgorithm would require many iterations to produce a mean-ingful average strategy. Therefore, after a few thousands ofiterations, we ﬁx the parameter p and perform a clean sec-ond run, with p ﬁxed from the ﬁrst run. Similarly to COMB,RQR achieves higher gains than both the QNE and the NEand performs exceptionally well in terms of exploitabilitywith gains comparable to COMB.We adapted both algorithms from NFGs also to EFGs.The COMB heuristic requires to compute a convex combi-nation of strategies, which is not straightforward in EFGs.Let p be a combination coefﬁcient and σ i , σ i ∈ Σ i be twodifferent strategies for the player i . The convex combinationof the strategies is a strategy σ i ∈ Σ computed for each in-formation set I i ∈ I i and action a ∈ A ( I i ) as follows: σ i ( I i )( a ) = (6) π σ i ( I i ) σ i ( I i )( a ) p + π σ i ( I i ) σ i ( I i )( a )( − p ) π σ i ( I i ) p + π σ i ( I i )( − p ) We search for a value of p that maximizes the gain, andwe call this approach the counterfactual COMB. Contrary to COMB, the RQR can be directly applied to EFGs. The ideais the same, but instead of regret matching, we use CFR. Wecall this heuristic algorithm the counterfactual RQR. Algorithms for Computing QSE

In general, the mathematical programs describing the QSEin NFGs and EFGs are non-concave, non-linear problems.We use the gradient ascent (GA) methods (Boyd and Van-denberghe 2004) to ﬁnd these programs’ local optimum. Incase a program’s formulation is concave, the GA will reacha global optimum. However, both formulations of QSE con-tain a fractional part, corresponding to a deﬁnition of the fol-lower’s canonical quantal function. Because concavity is notpreserved under division, accessing conditions of the con-cavity of these programs is difﬁcult. The GA performs wellon small games, but it does not scale at all even for moder-ately sized games, as we show in the experiments.Because QSE and QNE are usually non-equivalent con-cepts even in zero-sum games (see Figure 1), the regret-minimization algorithms will not converge to QSE. How-ever, in case a quantal function satisﬁes the so-called pretty-good-response condition, the algorithm converges to a strat-egy of the leader exploiting the follower the most (Davis,Burch, and Bowling 2014). We show that a class of simple(i.e., attaining only a ﬁnite number of values) quantal func-tions satisfy a pretty-good-responses condition.

Proposition 6.

Let G = ( N, A, u ) be a zero-sum NFG, QR a quantal response function of the follower, which dependsonly on the ordering of expected utilities of individual ac-tions. Then the RM-QR algorithm converges to QSE. An example of a simple quantal function depending onlyon the ordering of expected utilities is, e.g., a function as-signing probability . to the actions with the highest ex-pected utility, probability . to the action with the second-highest utility and probabilities . /(∣ A ∣ − ) to all re-maining actions. Note that the class of quantal functionssatisfying the conditions of pretty-good-responses still takesinto account the strategy of the opponent (i.e., the responsesare not static), but it is limited. In general, quantal functionsdo not satisfy the condition of pretty-good-responses. Proposition 7.

Let QR be canonical quantal function witha strictly monotonically increasing generator q . Then QR isnot a pretty-good-response. The experimental evaluation aims to compare solutions ofour proposed algorithm RQR with QNE strategies computedby RM-QR for NFGs and CFR-QR for EFGs. As baselines,we use (i) Nash equilibrium (NASH) strategies, (ii) a bestconvex combination of NASH and QNE denoted COMB,and (iii) an approximation of QSE computed by gradient as-cent (GA), initialized by NASH. We focus mainly on zero-sum games, because they allow for a more straightforwardinterpretation of the trade-offs between gain and exploitabil-ity. Still, we also provide results on general-sum NFGs. Fi-nally, we show that the performance of RQR is stable overdifferent rationality values and analyze the EFG algorithms $FWLRQV 7 L P H > V @ &20% *$ 545 1$6+ 5045 6( +LVWRULHV 7 L P H > V @ &20% *$ 545 1$6+ &)545 Figure 2: Running time comparison of COMB, GA, RQR,NASH, SE and QNE on general-sum NFGs (left) and zero-sum EFGs (right). NFGs are square, and size is the numberof actions for each player.more closely on well-known Leduc Hold’em game. The ex-perimental setup and all the domains are described in theappendix. The code will be published and is appended.

Scalability

The ﬁrst experiment shows the difference in runtimes of GAand regret-minimization approaches. In NFGs, we used ran-dom square zero-sum games as an evaluation domain, andthe runtimes are averaged over 1000 games per game sizewith [-10,9] integer payoffs. In EFGs, the generation proce-dure for random games does not guarantee the games willhave the same number of histories, so we clustered gameswith a similar size together, and report runtimes averagedover the clusters. The results on the right of Figure 2 showthat regret minimization approaches scale signiﬁcantly bet-ter – the tendency is very similar in both NFGs and EFGs,and we show the results for NFGs in the appendix.We report scalability in general-sum games on the leftin Figure 2. We generated 100 games of Grab the Dol-lar, Majority Voting, Travelers Dilemma, and War of At-trition with an increasing number of actions for both play-ers and also 100 randomly generated general-sum NFGs ofthe same size. In the rest of the experiments, we use setsof 1000 games with 100 actions for each class. We use aMILP formulation to compute the NE (Sandholm, Gilpin,and Conitzer 2005) and solve for SE using multiple linearprograms (Conitzer and Sandholm 2006). The performanceof GA against CFR-based algorithm is similar to the zero-sum case, and the only difference is in NE and SE, whichare even less scalable than GA.

Gain comparison

Now we turn to a comparison of gains of solutions of all al-gorithms in NFGs and EFGs. We report averages with stan-dard errors for zero-sum games in Figure 3 and general-sum games in Figure 4 (left). We use the NE strategy as abaseline, but as different NE strategies can achieve differentgains against the subrational opponent, we try to select thebest NE strategy. To achieve this, we ﬁrst compute a fea-sible NE. Then we run gradient ascent constrained to theset of NE, optimizing the expected value. We aim to showthat RQR performs even better than an optimized NE. More-over, also COMB strategies outperform the best NE, despiteCOMB using the (possibly suboptimal) NE strategy com-puted by CFR. 6L]HRIWKHJDPH * D L Q &20% *$ 545 5045 1$6+ 7HVWLQJVHW * D L Q &&20% *$ &545 &)545 1$6+ - Figure 3: Gain comparison of GA, Nash(SE), QNE, RQRand COMB in - Left: random square zero-sum NFGs withnumber of actions as size. Right: random zero-sum EFGs. *W' 09 5QG 7' :R$*DPHQDPH ( [S HF W H G X WLOLW \ &20% *$ 545 5045 1$6+ 6( *W' 09 5QG 7' :R$*DPHQDPH ( [S HF W H G X WLOLW \ &20% *$ 545 5045 1$6+ 6( Figure 4: Gain and Robustness comparison of GA, Nash,Strong Stackelberg (SE), QNE, COMB and RQR in generalsum NFGs. On the left is expected utility against LQR andright against such BR that is maximizing leader’s utility.The results show that GA for QSE is the best approachin terms of gain in zero-sum and general-sum games if weignore scalability issues. The scalable heuristic approachesalso achieve signiﬁcantly higher gain than both the NE base-line and competing QNE in both zero-sum and general-sumgames. On top of that, we show that in general-sum games,in all games except one, the heuristic approaches perform aswell as or better than SE. This indicates that they are usefulin practice even in general-sum settings.

Robustness comparison

In this work, we are concerned primarily with increasinggain. However, the higher gain might come at the expense ofrobustness–the quality of strategies might degrade if our ex-pected behavioral model of the opponent is incorrect. There-fore, we study also (i) the exploitability of computed solu-tions in zero-sum games and (ii) expected utility against thebest response that breaks ties in our favor in general-sumgames. Both correspond to performance against a perfectlyrational selﬁsh opponent.First, we report the mean exploitability in zero-sum gamesin Figure 5. Because the exploitability of NE is zero by def-inition, we do not include NE in the ﬁgure. We show thatQNE is highly exploitable in both NFGs and EFGs. COMBand GA perform similarly, and RQR has signiﬁcantly lowerexploitability compared to other modeling approaches. Sec-ond, we depict the results in general-sum games on the rightin Figure 4. By deﬁnition, SE is the optimal strategy and pro-vides an upper bound on achievable value. Unlike in zero-sum games, GA outperforms CFR-based approaches evenagainst the rational opponent. Our heuristic approaches arenot as good as entirely rational solution concepts, but theyalways perform better than QNE. 6L]HRIWKHJDPH ( [S O R LW D E LOLW \ &20% *$ 545 5045 7HVWLQJVHW ( [S O R LW D E LOLW \ &&20% *$ &545 &)545 - Figure 5: Robustness comparison of GA, COMB and RQRin random EFGs ( E( N ash ) = ). Different rationality.

In the fourth experiment, we access the algorithms’ perfor-mance against opponents with varying rationality parameter λ in the logit function. For λ ∈ { , . . . } we report theexpected utility on the left in Figure 6. For smaller valuesof λ (i.e., lower rationality), RQR performs similarly to GAand QNE, but it achieves lower exploitability. As rationalityincreases, the gain of RQR is found between GA and QNE,while having the lowest exploitability. For all values of λ ,both QNE and RQR report higher gain than NASH. We donot include COMB in the ﬁgure for the sake of better read-ability as it achieves similar results to RQR. Standard EFG Benchmarks

Poker.

Poker is a standard evaluation domain, and contin-ual resolving was demonstrated to perform extremely wellon it (Moravk et al. 2017). We tested our approaches on twopoker variants: one-card poker and Leduc Hold’em. We used λ = because for λ = , QNE is equal to QSE. We report thevalues achieved in Leduc Hold’em on the right in Figure 6.The horizontal lines correspond to NE and GA strategies, asthey do not depend on p . The heuristic strategies are reportedfor different p values. The leftmost point corresponds to theCFR-BR strategy and rightmost to the QNE strategy. Theexperiment shows that RQR performs very well for pokergames as it gets close to the GA while running signiﬁcantlyfaster. Furthermore, the strategy computed by RQR is muchless exploitable consistently throughout various λ values.This suggests that the restricted response can be success-fully applied not only against strategies independent of theopponent as in (Johanson, Zinkevich, and Bowling 2008),but also against adapting opponents. We observe similar per-formance also in the one-card poker and report the results inthe appendix. Large game.

We demonstrate our approach on Goofspiel7, a game with almost 100 million histories to show thepractical scalability. While CFR-QR, RQR, and CFR wereable to compute a strategy, the games of this size are beyondthe computing abilities of GA and memory requirements ofCOMB. CFR-QR has exploitability 4.045 and gains 2.357,RQR has exploitability 3.849 and gains 2.412, and CFRgains 1.191 with exploitability 0.115. RQR hence performsthe best in terms of gain and outperforms CFR-QR in ex-ploitability. All algorithms used 1000 iterations. 5DWLRQDOLW\ ( [S HF W H G X WLOLW \ *$ &545 &)545 1$6+ - S ( [S HF W H G X WLOLW \ &&20% &545 *$ 1$6+ Figure 6: Left: Average over 100 games from set 2 of EFGswith different rationalities. Expected utility against CLQR(dashed) and BR (solid) is reported. Right: Expected util-ity for different algorithms against CLQR (dashed) and BR(solid) in Leduc Hold’em. p is a constant for both regret min-imization approaches. NASH and GA are also reported andQNE is the value of COMB or RQR with p = . Summary of the Results

In the experiments, we have shown three main points. (1)GA approach does not scale even to moderate games, mak-ing regret minimization approaches much better suited tolarger games. (2) In both normal-form and extensive-formgames, the RQR approach outperforms NASH and QNEbaseline in terms of gain and outperforms QNE in terms ofexploitability, making it currently the best approach againstLQR opponents in large games. (3) Our algorithms performbetter than the baselines, even with different rationality val-ues, and can be successfully used even in general games.Visual comparison of the algorithms in zero-sum games isprovided in the following table. Scalability denotes how wellthe algorithm scales to larger games. The marks range fromthree minuses as the worst to three pluses as the best withNE being the 0 baselines.

COMB RQR QNE NE GA

Scalability - 0 0 0 - - -Gain ++ ++ + 0 +++Exploitability - - - - - - 0 - -

Bounded rationality models are crucial for applications thatinvolve human decision-makers. Most previous results onbounded rationality consider games among humans, whereall players’ rationality is bounded. However, artiﬁcial intelli-gence applications in real-world problems pose a novel chal-lenge of computing optimal strategies for an entirely ratio-nal system interacting with bounded-rational humans. Wecall this optimal strategy Quantal Stackelberg Equilibrium(QSE) and show that natural adaptations of existing algo-rithms do not lead to QSE, but rather to a different solutionwe call Quantal Nash Equilibrium (QNE). As we observe,there is a trade-off between computability and solution qual-ity. QSE provides better strategies, but it is computationallyhard and does not scale to large domains. QNE scales sig-niﬁcantly better, but it typically achieves lower utility thanQSE and might be even worse than the worst Nash equi-librium. Therefore, we propose a variant of counterfactualregret minimization which, based on our experimental eval-uation, scales to large games, and computes strategies thatoutperform QNE against both the quantal response opponentand the perfectly rational opponent. eferences

Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013.Online implicit agent modelling. In

Proceedings of the 2013international conference on Autonomous agents and multi-agent systems , 255–262.Basak, A.; ˇCern, J.; Gutierrez, M.; Curtis, S.; Kamhoua, C.;Jones, D.; Boˇsansk, B.; and Kiekintveld, C. 2018. An initialstudy of targeted personality models in the ﬂipit game. In

International Conference on Decision and Game Theory forSecurity , 623–636. Springer.Boyd, S.; and Vandenberghe, L. 2004.

Convex Optimization .Cambridge University Press.Brown, N.; and Sandholm, T. 2018. Superhuman AI forheads-up no-limit poker: Libratus beats top professionals.

Science

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelli-gence, IJCAI-20 , 246–253. Main track.Conitzer, V.; and Sandholm, T. 2006. Computing the opti-mal strategy to commit to. In

Proceedings of the 7th ACMconference on Electronic commerce , 82–90.Daskalakis, C.; Goldberg, P. W.; and Papadimitriou, C. H.2009. The complexity of computing a Nash equilibrium.

SIAM Journal on Computing

Twenty-EighthAAAI Conference on Artiﬁcial Intelligence .Delle Fave, F. M.; Jiang, A. X.; Yin, Z.; Zhang, C.; Tambe,M.; Kraus, S.; and Sullivan, J. P. 2014. Game-theoretic pa-trolling with dynamic execution uncertainty and a case studyon a real transit system.

Journal of Artiﬁcial Intelligence Re-search

50: 321–367.Fang, F.; Nguyen, T. H.; Pickles, R.; Lam, W. Y.; Clements,G. R.; An, B.; Singh, A.; Schwedock, B. C.; Tambe, M.;and Lemieux, A. 2017. PAWS-A Deployed Game-TheoreticApplication to Combat Poaching.

AI Magazine

Proceedings of the AAAI Confer-ence on Artiﬁcial Intelligence , volume 33, 1917–1925.Haile, P. A.; Hortac¸su, A.; and Kosenok, G. 2008. On theempirical content of quantal response equilibrium.

Ameri-can Economic Review

Econometrica

Artiﬁcial Intelligence and Statistics ,264–271. Johanson, M.; Zinkevich, M.; and Bowling, M. 2008. Com-puting robust counter-strategies. In

Advances in neural in-formation processing systems , 721–728.Letchford, J.; and Conitzer, V. 2010. Computing optimalstrategies to commit to in extensive-form games. In

Pro-ceedings of the 11th ACM conference on Electronic com-merce , 83–92.Lockhart, E.; Lanctot, M.; Prolat, J.; Lespiau, J.-B.; Morrill,D.; TImbers, F.; and Tuyls, K. 2019. Computing Approx-imate Equilibria in Sequential Adversarial Games by Ex-ploitability Descent. In

Proceedings of the Twenty-EighthInternational Joint Conference on Artiﬁcial Intelligence,IJCAI-19 , 464–470.McFadden, D. L. 1976. Quantal choice analaysis: A survey.In

Annals of Economic and Social Measurement, Volume 5,number 4 , 363–390. NBER.McKelvey, R. D.; and Palfrey, T. R. 1995. Quantal responseequilibria for normal form games.

Games and economic be-havior

Experimentaleconomics

Science

Pro-ceedings of the 3rd International Joint Conference on Au-tonomous Agents and Multiagent Systems , volume 4, 880–887.Pita, J.; Jain, M.; Marecki, J.; Ord´o˜nez, F.; Portway, C.;Tambe, M.; Western, C.; Paruchuri, P.; and Kraus, S. 2008.Deployed ARMOR protection: The application of a gametheoretic model for security at the Los Angeles Interna-tional Airport. In

Proceedings of the 7th International JointConference on Autonomous Agents and Multiagent Systems ,125–132.Sandholm, T.; Gilpin, A.; and Conitzer, V. 2005. Mixed-integer programming methods for ﬁnding Nash equilibria.In

AAAI , 495–501.Tambe, M. 2011.

Security and Game Theory: Algorithms,Deployed Systems, Lessons Learned . New York, NY,USA: Cambridge University Press. ISBN 1107096421,9781107096424.Turocy, T. L. 2005. A dynamic homotopy interpretationof the logistic quantal response equilibrium correspondence.

Games and Economic Behavior

Proceedings of the 11th International Conference on Au-tonomous Agents and Multiagent Systems-Volume 2 , 847–854.inkevich, M.; Johanson, M.; Bowling, M.; and Piccione,C. 2008. Regret minimization in games with incompleteinformation. In

Advances in Neural Information ProcessingSystems , 1729–1736.

Proofs

Proof of Theorem 1

Lemma 1.

Let

A = { a , a , . . . , a n } , a i ∈ R , a = max (A) > . Then it holds that max (A) − soft max λ (A) ≤ W ( / e ) λ + ( n − ) e − λ . (7) Proof.

We proceed by induction on the size of the set A .Base case: Let n = . Because a ≤ a , any a can be writtenas a x, x ≤ . For a given λ , the difference between max and soft max can be written as d ( x ) = a − a e λa + a xe λa x e λa + e λa x . To ﬁnd a maximum of this function, we differentiate it by x ,which yields d ′ ( x ) = − a e λa x ( e λa ( λa ( x − ) + ) + e λa x )( e λa x + e λa ) . For a > , the function d ′ has a root r = a λ − W ( / e ) − a λ , where W is the Lambert function. The root is unique, be-cause the inner function e λa ( λa ( x − ) + ) + e λa x isincreasing as its derivative is positive for all x ≤ . It is amaximum of d , because d ′′ ( r ) < . By plugging the rootinto the function d , we obtain the upper bound on the dis-tance between max and soft max : d ( r ) = W ( / e ) λ , which is independent on a , a .Induction step: For a given ∣A∣ = n , assume max (A) − soft max λ (A) ≤ C . Consider a new a n + ≤ a . Again, weset a n + = a x, x ≤ . For a given λ , the difference between max and soft max can be written as a − ∑ ni = a i e λa i + a xe λa x ∑ ni = e λa i + e λa x ≤ C + ( a − a x ) e λa x e λa , because the exp function is strictly greater than zero. To ﬁnda maximum of the second term, we differentiate it by x : ( ( a − a x ) e λa x e λa ) ′ = a λ ( − x ) e a λ ( x − ) − a e a λ ( x − ) . As in the base case, for a > the derivative has a root r = − λa , ( a − a r ) e λa r e λa = e − λ . The root is unique, because the derivative is positive increas-ing on (−∞ , − / a λ ) and decreasing on ( − / a λ, ] , asdifferentiating it for the second time reveals. Therefore, weobtain the upper bound max (A ∪ a n + ) − soft max λ (A ∪ a n + ) ≤ C + e − λ . The result follows from the induction. Note that the upperbound goes to zero as λ approaches inﬁnity. Theorem 1.

Computing a QNE strategy proﬁle in two-player NFGs is a PPAD-hard problem.Proof.

Let ˜ G be a 2-player NFG with strictly positive util-ities, in which one of the players has n actions to play.Computing an (cid:15) -NASH in ˜ G is PPAD-complete (Daskalakis,Goldberg, and Papadimitriou 2009). We show that comput-ing QNE is PPAD-hard by reducing the problem of ﬁnding (cid:15) -NASH in ˜ G to a problem of computing a speciﬁc QNE in ˜ G .We construct the reduced game as follows: let the playerwith n actions be the subrational player and let q be from alogit class, i.e., q ( x ) = e λx for some λ . Assume that thereexists λ ∗ , such that for each (cid:15) and each strategy σ △ of theleader u ▽ ( σ △ , BR ( σ △ )) − u ▽ ( σ △ , QR ( σ △ )) ≤ (cid:15) . Becausethe leader plays fully rationally, his QNE strategy is a bestresponse. By the deﬁnition of λ ∗ , the follower’s QR is an (cid:15) -best response. Therefore, by solving for QNE with q ( x ) = e λ ∗ x , we ﬁnd an (cid:15) -NASH in ˜ G .Each strategy σ of a leader generates expected utilities forthe follower, playing BR corresponds to max , playing QRcorresponds to soft max . Because the game we reduce fromhas n actions, there are n expected utilities, we can henceuse the lemma. Setting λ ∗ = W ( / e ) (cid:15) + ( n − ) e − λ ﬁnishes theproof. Proof of Proposition 2

Proposition 2.

For any

LQR function. There exists a zero-sum normal-form game G = ( N, A, u ) with a unique NE - ( σ NE △ , σ NE ▽ ) and QNE - ( σ QNE △ , QR ( σ QNE △ )) such that u △ ( σ NE △ , QR ( σ NE △ )) > u △ ( σ QNE △ , QR ( σ QNE △ )) .A B CX -6 9 9Y 3 0 2 Proof.

In the provided game, the only NE for player △ is ( , ) and QR against it results in expected utility 1.6438.QNE strategy for player △ is (0.1744,0.8256) resulting inexpected utility 1.6366. Therefore, in this game with λ = QNE is worse than NE against QR. For a different λ > , theutilities can be re-scaled by λ to achieve a similar result. Proof of Theorem 4

Theorem 4.

Figure 7: An NFG and a criterion function of its QSE withgenerator q ( x ) = exp ( x ) .In the ﬁrst equilibrium, the rational player △ plays the ﬁrstaction with probability s . The second equilibrium is whenshe plays the second action with probability s . The expectedreward of △ when playing either of these strategies is m ,while any other strategy, and particularly the uniform strat-egy, achieves a lower reward.Now we can proceed to constructing the game, whichmakes the rational player to commit to a strategy thatsolve the partition problem. The game starts with a uniformchance node. For each item, there is a subgame as indicatedin the game in Figure 8 (for two of the items x i and x j ). i x i j x j . . . Figure 8: A constructed EFG for a partition problem.There are two main components of each subgame. Theﬁrst component (on the left) – the NFG subtree – is the EFGrepresentation of the NFG game introduced earlier. To max-imize her utility, the rational player △ is motivated to playeither the ﬁrst action with probability s or − s , but not auniform strategy. The second part – the partition subtree –solves the partition problem. Solvable instances

First, we construct the QSE of thisgame in case the partition problem has a solution, i.e., thereexists an index set I for which Eq. (8) holds. To maximizethe utility in the NFG subtrees, in each of her informationsets player △ chooses only from the two strategies s and − s . For each item, if the item belongs to the set I , shechooses the strategy s . If she chooses strategy − s , it meansthe item is from the complementary set. The expected util-ities of player ▽ of actions a , a in the lower information set are u ▽ ( a ) = n ⎛⎝− ∑ i ∈ I sx i − ∑ i ∈{ ...n }∖ I ( − s ) x i ⎞⎠ u ▽ ( a ) = n ⎛⎝− ∑ i ∈ I ( − s ) x i − ∑ i ∈{ ...n }∖ I sx i ⎞⎠ . Because I is the solution, we have u ▽ ( a ) = u ▽ ( a ) andplayer ▽ is incentivized to play uniformly. The △ ’s utilityin the partition subtrees is hence u U △ = ∑ i ∈{ ,...,n } x i n = n ∑ i ∈{ ,...,n } x i . Next, we show that utility u U △ is optimal in the partition sub-trees – player △ can never get a higher utility. Let x be avector of the multiset integers of the partition problem and σ be a vector of arbitrary probabilities of playing the ﬁrstaction in player △ ’s partition subtrees. We aim to prove thatfor any σ and the corresponding vector of complementaryprobabilities of playing the second actions − σ it holds that n x T σq (− x T σ / n ) + x T ( − σ ) q (− x T ( − σ )/ n ) q (− x T σ / n ) + q (− x T ( − σ )/ n ) ≤ u U △ . Simple algebra shows this is equivalent to x T ( σ − / )( q (− x T σ / n ) − q (− x T ( − σ )/ n )) ≤ . (9)Because we have q (− x T σ / n )− q (− x T ( − σ )/ n ) ≤ ⇐⇒ x T ( σ − / ) ≥ , Eq. (9) always holds and u U △ is indeed an upper bound. Be-cause player △ ’s utility is maximized in both the NFG andthe partition subtrees, it is a QSE and her utility if the parti-tion problem is solvable is therefore u ∗△ = m / + / n ∑ i ∈{ ,...,n } x i . Unsolvable instances

Second, assume that the partitionproblem does not have a solution. We show that in this case,the utility of player △ in the QSE will be always strictlylower than u ∗△ . Observe that because QSE with solvable in-stances achieves a maximum possible utility in the partitionsubtrees, in order to attempt to reach the same overall util-ity with unsolvable instances, player △ has to commit to thesolution of the NFG game. Therefore, in each partition sub-tree, her only viable strategy is to play the ﬁrst action withprobability either s or − s . First, we analyze the utility ofplayer △ in case the strategy of player ▽ is not uniform.From Eq. (9), it follows that in case a vector σ maximizes autility of player △ , it holds that x T ( σ − / )( q (− x T σ / n ) − q (− x T ( − σ )/ n )) = . Consequently, if the strategy is not uniform, the differencein quantal functions is nonzero and it is easy to show thatalso the scalar product x T ( σ − / ) never reaches zero, thus,making impossible for a non-uniform strategy to be optimal.Therefore, to achieve utility u ∗△ , player △ has to enforce aniform strategy of player ▽ . Given that player △ has tocommit to either s or − s in her upper information sets,we analyze the conditions when player ▽ is incentivized toplay a uniform strategy. Let the set I be deﬁned similarly asearlier: an item belongs to I if the ﬁrst action in player △ ’spartition subtree is played with probability s . We have u ▽ ( a ) = u ▽ ( a ) ⇐⇒ ( − s ) ∑ i ∈{ ...n } x i + ( s − ) ∑ i ∈{ ...n }∖ I x i = . Because there is no I such that the sums are equal and be-cause by the setting of the NFG game s ≠ − s , player △ never simultaneously enforces optimal utility in the NFGgame and the partition subtrees. Her utility is hence strictlysmaller than u ∗△ . By analyzing the QSE of the reduced gamewe hence separate solvable and unsolvable instances of thepartition problem. General-sum games

The situation in non-zero-sumgames is even simpler. The structure of the proof is exactlyas the proof for zero-sum games above, but the role of theNFG subtree can be played by the cooperative coordinationgame:

A BX 1,1 0,0Y 0,0 1,1

For any quantal response function, player ▽ plays the actionwith higher expected utility with a higher probability. There-fore, the uniform strategy for player △ corresponds to thestrict minimum of his utility achievable against any quan-tal opponent. Any other strategy will make the two actionsof player ▽ have different expected utilities and hence thebetter will be played with probability more than 0.5, givingplayer △ better reward than the uniform strategy. Since thegame is completely symmetric, it has two distinct QSEs.A similar argument holds also for the partition subtree,which stays unchanged from the zero-sum game. In solv-able instances, player △ ’s commitment makes any quantalplayer be indifferent and play uniformly. In case of unsolv-able instance, one of her action will be better and playedwith a strictly higher probability. This will give player ▽ more utility than the uniform strategy and hence it would besuboptimal for player △ . Proof of Proposition 6

Proposition 6.

A response function f is called a pretty-good-response if it satisﬁes u ▽ ( σ △ , f ( σ △ )) ≥ u ▽ ( σ △ , f ( σ ′△ )) ∀ σ △ , σ ′△ ∈ Σ △ . (10)Let QR be a simple quantal response function of the fol-lower, which depends only on the descending ordering ofexpected utilities of follower’s actions and consider two dif-ferent σ △ , σ ′△ ∈ σ △ . In case σ △ induces the same order-ing as σ ′△ , then u ▽ ( σ △ , QR ( σ △ )) = u ▽ ( σ △ , QR ( σ ′△ )) . Let σ △ induce an ordering of indices i , i , . . . , i n and σ ′△ induce a different ordering j , j , . . . , j n . By deﬁni-tion of a quantal function, QR ( σ △ , a i ) ≥ QR ( σ △ , a i ) ≥⋅ ⋅ ⋅ ≥ QR ( σ △ , a i n ) and QR ( σ ′△ , j ) ≥ QR ( σ ′△ , a j ) ≥⋅ ⋅ ⋅ ≥ QR ( σ ′△ , j n ) . For each k ∈ [ n ] it holds that u ▽ ( σ △ , a i k ) QR ( σ △ , a i k ) ≥ u ▽ ( σ △ , a i k ) QR ( σ ′△ , a j k ) andtherefore u ▽ ( σ △ , QR ( σ △ )) ≥ u ▽ ( σ △ , QR ( σ ′△ )) . SimpleQR is hence a pretty-good-response and RMQR convergesto a strategy exploiting pretty-good-responses the most,which is a QSE strategy. Proof of Proposition 7

Proposition 7.

Let QR be canonical quantal function witha strictly monotonically increasing generator q . Then QR isnot a pretty-good-response. Game 3A BX b aY c aFigure 9: An example of NFG for which no monotonicallyincreasing canonical quantal function constitutes a pretty-good-response. Proof.

In Figure 9, we construct a game G with A ▽ =( A, B ) , such that no canonical quantal function is a pretty-good-response in this game. Let a, b, c ∈ R , such that a < b < c . Since q is strictly monotonically increasing, we have q ( a ) < q ( b ) < q ( c ) . By the deﬁnition of canonical quantalresponse, we have QR ( Y, A ) − QR ( X, A ) = QR ( Y, B ) − QR ( X, B ) . Because q ( b ) < q ( c ) , both sides of the equa-tion are positive. Since a < b it holds that b ( QR ( Y, A ) − QR ( X, A )) > a ( QR ( Y, B ) − QR ( X, B )) , therefore bQR ( Y, A ) + aQR ( Y, B ) > bQR ( X, A ) + aQR ( X, B ) andﬁnally u ▽ ( X, QR ( Y )) > u ▽ ( X, QR ( X )) . By deﬁnition inEquation (10), QR is hence not a pretty-good-response. B Evaluation

Experimental setup.

For all experiments except Goof-spiel 7, we use Python 3.7. We solve non-linear optimiza-tion using the SLSQP GA from the SciPy 1.3.1 library. LPcomputations are done using gurobi 8.1.1, and experimentswere done on Intel i7 1.8GHz CPU with 8GB RAM. Goof-spiel experiment was run on 24 cores/48 threads 3.2GHz (2x Intel Xeon Scalable Gold 6146) with 384GB of RAM, im-plemented in C++. For experiments on zero-sum NFGs, weused randomly generated square games and for general-sumNFGs we used randomly generated games, Grab the Dol-lar, Majority Voting, Traveler’s Dilema and War of Attritionfrom GAMUT (Nudelman et al. 2004). For EFGs, we usedrandomly generated sequential games, and Leduc Hold’em.In the experiments, we wanted to measure the scalability andperformance of the proposed solutions and the baseline.

Domains.

Randomly Generated NFGs are parametrizedby sizes of both players’ action spaces. Utilities are gener-ated uniformly at random from integers between -9 and 10. rab the Dollar is a game with a prize that both playerscan grab at any given time, actions being the times. If bothplayers grab it at the same time they both receive low payoffand when one player grabs the price before the opponent shereceives high payoff and the opponent payoff somewhere be-tween high and low. In

Majority Voting the players haveutilities assigned to each action (candidate) being declaredwinner. And the winner is the candidate with the most votes.In a tie a candidate with higher priority is declared winner.

Travelers Dillema is a game where both players propose apayoff and the player with lower proposal wins the payoffplus some bonus and the opponent receives the payoff minussome bonus. In a

War of Attrition , two players are in a dis-pute over an object, and each chooses a time to concede theobject to the other player. If both concede at the same time,they share the object. Each player has a valuation of the ob-ject, and each players utility is decremented at every timestep.

Randomly Generated EFGs are EFGs where playersswitch each turn. The game has three parameters. One is thebranching factor b , the second is the maximal number of ob-servations received o , and the last one is maximal sequencelength for one player l . Therefore, the maximal depth is l .The path from the root correlates utilities, and the genera-tion of utilities proceeds as follows. The value is set to 0at the root and randomly changes by one up or down eachtime when moving to the children. The utility of a historyis the value with which the leaf node is reached. We gener-ated four sets in following way. Set b = , o = , l = , b = , o = , l = , b = , o = , l = , b = , o = , l = .During the generation we discarded the games where NEstrategy was the same as GA strategy because such degener-ate games would have all the values that we report the same.We kept generating and discarding until we had 100 gamesin each set. Number of games we had to generate in orderto obtain 100 non degenerate games in each set is: - 1431, - 212, - 159, - 112.For Leduc Hold’em we use thedeﬁnition from (Lockhart et al. 2019).

Goofspiel 7 is a bid-ding card game where players are trying to obtain the mostpoints. shufﬂed and set face-down. Each turn, the top pointcard is revealed, and players simultaneously play a bid card;the point card is given to the highest bidder or discarded ifthe bids are equal. In this implementation, we use a ﬁxeddeck with K = 7.

C Mathematical program to solve QSE

Observation 8.

Let G be an extensive-form game and a q bea generator of a canonical quantal function. Then QSE of G can be formulated as a following non-concave mathematicalprogram: max r △ v △ ( root ) (11) r i (∅) = ∀ i ∈ N (12) ≤ r i ( s i ) ≤ ∀ s i ∈ S i , ∀ i ∈ N (13) r i ( s i ) = ∑ a ∈ A i ( I i ) r i ( s i a ) (14) ∀ s i ∈ S i , I i ∈ inf i ( s i ) , ∀ i ∈ Nv i ( I ) = ∑ a ∈ A i ( I ) f i ( I, a ) r i ( s i a ) (15) ∀ I ∈ I i , s i = seq i ( I ) , ∀ i ∈ Nr ▽ ( s ▽ a ) = r ▽ ( s ▽ ) q ( f ▽ ( I, a ))∑ a ∈ A ▽ ( I ) q ( f ▽ ( I,a )) (16) ∀ s ▽ ∈ S ▽ , I = inf ▽ ( s ▽ ) , ∀ a ∈ A ▽ ( I ) f ▽ ( I, a ) = ∑ I ′ ∈I ▽ ∶ s ▽ a = seq ▽ ( I ′ ) v ( I ′ ) + ∑ s △ ∈ S △ u ( s △ , s ▽ a ) r △ ( s △ ) (17) ∀ I ∈ I ▽ , s ▽ = seq ▽ ( I ) , ∀ a ∈ A ▽ ( I ) f △ ( I, a ) = ∑ I ′ ∈I △ ∶ s △ a = seq △ ( I ′ ) v ( I ′ ) + ∑ s ▽ ∈ S ▽ u ( s △ a, s ▽ ) r ▽ ( s ▽ ) (18) ∀ I ∈ I △ , s △ = seq △ ( I ) , ∀ a ∈ A △ ( I ) Equation 11 is for maximizing the expected value ofplayer △ in the root over his realization plans.Equation 12 ﬁxes probability of empty realization plan to1, Equation 13 constraints realization plans as probabilitiesand Equation 14 deﬁnes the relationship of child plans totheir parents. Equation 15 deﬁnes v i ( I ) as sum of values ineach children times the realization plan there. Equation 16deﬁnes the quantal response in realization plans of player ▽ . Finally Equations 17 and 18 deﬁne the action value sum-ming over both descendant infosets and terminal nodes. Be-cause of Equation (16), the program is not linear. The prob-lem of computing the QSE is computationally difﬁcult tosolve – it is an NP-hard problem. Scalability on NFGs.

Figure 10 shows running time averaged over 1000 games foreach size of square zero-sum NFGs. Ranging from 2 actionsup to 377 actions. $FWLRQV 7 L P H > V @ &20% *$ 545 1$6+ 5045 Figure 10: Running time comparison of COMB, GA, RQR,NASH(SE), and QNE on zero-sum NFGs

E One card poker results.

Figure 11 shows the expected utility of the COMB and RQRwhen run with ﬁxed p , for different values of p . CFR resultsare on left end of the RQR and COMB lines and CFR-QR ison the right end. S ( [S HF W H G X WLOLW \ &20% 545 *$ 1$6+ Figure 11: Expected utility for different algorithms againstCLQR (dashed) and BR (solid) in one card poker. p is a con-stant for both regret minimization approaches. NASH andGA are also reported and CFR-QR is the value of COMB orRQR with p =1