Complexity and Algorithms for Exploiting Quantal Opponents in Large Two-Player Games
CComplexity and Algorithms for Exploiting Quantal Opponentsin Large Two-Player Games
David Milec, Jakub ˇCern´y , Viliam Lis´y , Bo An Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore
Abstract
Solution concepts of traditional game theory assume entirelyrational players; therefore, their ability to exploit subrationalopponents is limited. One type of subrationality that describeshuman behavior well is the quantal response. While thereexist algorithms for computing solutions against quantal op-ponents, they either do not scale or may provide strategiesthat are even worse than the entirely-rational Nash strate-gies. This paper aims to analyze and propose scalable algo-rithms for computing effective and robust strategies against aquantal opponent in normal-form and extensive-form games.Our contributions are: (1) we define two different solutionconcepts related to exploiting quantal opponents and analyzetheir properties; (2) we prove that computing these solutionsis computationally hard; (3) therefore, we evaluate severalheuristic approximations based on scalable counterfactual re-gret minimization (CFR); and (4) we identify a CFR variantthat exploits the bounded opponents better than the previouslyused variants while being less exploitable by the worst-caseperfectly-rational opponent.
Extensive-form games are a powerful model able to describerecreational games, such as poker, as well as real-world situ-ations from physical or network security. Recent advances insolving these games, and particularly the Counterfactual Re-gret Minimization (CFR) framework (Zinkevich et al. 2008),allowed creating superhuman agents even in huge games,such as no-limit Texas hold’em with approximately different decision points (Moravk et al. 2017; Brown andSandholm 2018). The algorithms generally approximate aNash equilibrium, which assumes that all players are per-fectly rational, and is known to be inefficient in exploitingweaker opponents. An algorithm that would be able to takean opponent’s imperfection into account is expected to winby a much larger margin (Johanson and Bowling 2009; Bardet al. 2013).The most common model of bounded rationality in hu-mans is the quantal response (QR) model (McKelvey andPalfrey 1995, 1998). Multiple experiments identified it asa good predictor of human behavior in games (Yang, Or-donez, and Tambe 2012; Haile, Hortac¸su, and Kosenok hearth of the algorithms success-fully deployed in the real world (Yang, Ordonez, and Tambe2012; Fang et al. 2017). It suggests that players respondstochastically, picking better actions with higher probabil-ity. Therefore, we investigate how to scalably compute agood strategy against a quantal response opponent intwo-player normal-form and extensive-form games. If both players choose their actions based on the QRmodel, their behavior is described by quantal response equi-librium (QRE). Finding QRE is a computationally tractableproblem (McKelvey and Palfrey 1995; Turocy 2005), whichcan be also solved using the CFR framework (Farina, Kroer,and Sandholm 2019). However, when creating AI agentscompeting with humans, we want to assume that one of theplayers is perfectly rational, and only the opponent’s ra-tionality is bounded . A tempting approach may be using thealgorithms for computing QRE and increasing one player’srationality or using generic algorithms for exploiting oppo-nents (Davis, Burch, and Bowling 2014) even though the QRmodel does not satisfy their assumptions, as in (Basak et al.2018). However, this approach generally leads to a solutionconcept we call Quantal Nash Equilibrium, which we showis very inefficient in exploiting QR opponents and may evenperform worse than an arbitrary Nash equilibrium.Since the very nature of the quantal response model as-sumes that the sub-rational agent responds to a strategyplayed by its opponent, a more natural setting for study-ing the optimal strategies against QR opponents are Stackel-berg games, in which one player commits to a strategy thatis then learned and responded to by the opponent. Optimalcommitments against quantal response opponents - QuantalStackelberg Equilibrium (QSE) - have been studied in secu-rity games (Yang, Ordonez, and Tambe 2012), and the re-sults were recently extended to normal-form games ( ˇCern´yet al. 2020). Even in these one-shot games, polynomial algo-rithms are available only for their very limited subclasses. Inextensive-form games, we show that computing the QSE isNP-hard, even in zero-sum games. Therefore, it is very un-likely that the CFR framework could be adapted to closelyapproximate these strategies. Since we aim for high scala-bility, we focus on empirical evaluation of several heuristics,including using QNE as an approximation of QSE. We iden-tify a method that is not only more exploitative than QNE,but also more robust when the opponent is rational. a r X i v : . [ c s . A I] S e p ur contributions are: We analyze the relationship andproperties of two solution concepts with quantal opponentsthat naturally arise from Nash equilibrium (QNE) and Stack-elberg equilibrium (QSE). We prove that computing QNEis PPAD-hard even in NFGs, and computing QSE in EFGsis NP-hard. Therefore, we investigate the performance ofCFR-based heuristics against QR opponents. The extensiveempirical evaluation on four different classes of games withup to histories identifies a variant of CFR- f (Davis,Burch, and Bowling 2014) that computes strategies betterthan both QNE and NE. Even though our main focus is on extensive-form games, westudy the concepts in normal-form games, which can be seenas their conceptually simpler special case. After defining themodels, we proceed to define quantal response and the met-rics for evaluating a deployed strategy’s quality.
Two-player Normal-form Games
A two-player normal-form game (NFG) is a tuple G =( N, A, u ) where N = {△ , ▽} is set of players. We use i and − i for one player and her opponent. A = { A △ , A ▽ } de-notes the set of ordered sets of actions for both players. The utility function u i ∶ A △ × A ▽ → R assigns a value for eachpair of actions. A game is called zero-sum if u △ = − u ▽ . Mixed strategy σ i ∈ Σ i is a probability distribution over A i . For any strategy profile σ ∈ Σ = { Σ △ × Σ ▽ } we use u i ( σ ) = u i ( σ i , σ − i ) as the expected outcome for player i ,given the players follow strategy profile σ . A best response (BR) of player i to the opponent’s strategy σ − i is a strategy σ BRi ∈ BR i ( σ − i ) , where u i ( σ BRi , σ − i ) ≥ u i ( σ ′ i , σ − i ) forall σ ′ i ∈ Σ i . An (cid:15) - best response is σ (cid:15)BRi ∈ (cid:15)BR i ( σ − i ) , (cid:15) > , where u i ( σ (cid:15)BRi , σ − i ) + (cid:15) ≥ u i ( σ ′ i , σ − i ) for all σ ′ i ∈ Σ i .Given a normal-form game G = ( N, A, u ) , a tuple of mixedstrategies ( σ NEi , σ NE − i ) , σ NEi ∈ Σ i , σ NE − i ∈ Σ − i is a NashEquilibrium if σ NEi is an optimal strategy of player i againststrategy σ NE − i . Formally: σ NEi ∈ BR ( σ NE − i ) ∀ i ∈ {△ , ▽} In many situations, the roles of the players are asym-metric. One player (leader - △ ) has the power to committo a strategy, and the other player (follower - ▽ ) plays thebest response. This model has many real-world applica-tions (Tambe 2011); for example, the leader can correspondto a defense agency committing to a protocol to protect crit-ical facilities. The common assumption in the literature isthat the follower breaks ties in favor of the leader. Then, theconcept is called a Strong Stackelberg Equilibrium (SSE).A leader’s strategy σ SSE ∈ Σ △ is a Strong Stack-elberg Equilibrium if σ △ is an optimal strategy of theleader given that the follower best-responds. Formally: σ SSE △ = arg max σ ′△ ∈ Σ △ u △ ( σ ′△ , BR ▽ ( σ ′△ )) . In zero-sumgames, SSE is equivalent to NE (Conitzer and Sandholm2006) and the expected utility is denoted value of the game . Two-player Extensive-form Games
A two-player extensive-form game (EFG) consist of a set ofplayers N = {△ , ▽ , c } , where c denotes the chance. A is a fi-nite set of all actions available in the game. H ⊂ { a a ⋯ a n ∣ a j ∈ A, n ∈ N } is the set of histories in the game. We assumethat H forms a non-empty finite prefix tree. We use g ⊏ h todenote that h extends g . The root of H is the empty sequence ∅ . The set of leaves of H is denoted Z and its elements z are called terminal histories . The histories not in Z are non-terminal histories . By A ( h ) = { a ∈ A ∣ ha ∈ H } we denotethe set of actions available at h . P ∶ H ∖ Z → N is the playerfunction which returns who acts in a given history. Denoting H i = { h ∈ H ∖ Z ∣ P ( h ) = i } , we partition the histories as H = H △ ∪ H ▽ ∪ H c ∪ Z . σ c is the chance strategy defined on H c . For each h ∈ H c , σ c ( h ) is a probability distribution over A ( h ) . Utility functions assign each player utility for eachleaf node, u i ∶ Z → R .The game is of imperfect information if some actions orchance events are not fully observed by all players. The in-formation structure is described by information sets for eachplayer i , which form a partition I i of H i . For any informa-tion set I i ∈ I i , any two histories h, h ′ ∈ I i are indistin-guishable to player i . Therefore A ( h ) = A ( h ′ ) whenever h, h ′ ∈ I i . For I i ∈ I i we denote by A ( I i ) the set A ( h ) andby P ( I i ) the player P ( h ) for any h ∈ I i .A strategy σ i ∈ Σ i of player i is a function that assignsa distribution over A ( I i ) to each I i ∈ I i . A strategy profile σ = ( σ △ , σ ▽ ) consists of strategies for both players. π σ ( h ) is the probability of reaching h if all players play accordingto σ . We can decompose π σ ( h ) = ∏ i ∈ N π σi ( h ) into eachplayer’s contribution. Let π σ − i be the product of all players’contributions except that of player i (including chance). For I i ∈ I i define π σ ( I i ) = ∑ h ∈ I i π σ ( h ) , as the probability ofreaching information set I i given all players play accordingto σ . π σi ( I i ) and π σ − i ( I i ) are defined similarly. Finally, let π σ ( h, z ) = π σ ( z ) π σ ( h ) if h ⊏ z , and zero otherwise. π σi ( h, z ) and π σ − i ( h, z ) are defined similarly. Using this notation, expectedpayoff for player i is u i ( σ ) = ∑ z ∈ Z u i ( z ) π σ ( z ) . BR, NEand SSE are defined as in NFGs.Define u i ( σ, h ) as an expected utility given that the his-tory h is reached and all players play according to σ . A coun-terfactual value v i ( σ, I ) is the expected utility given that theinformation set I is reached and all players play accordingto strategy σ except player i , which plays to reach I . For-mally, v i ( σ, I ) = ∑ h ∈ I,z ∈ Z π σ − i ( h ) π σ ( h, z ) u i ( z ) . And simi-larly counterfactual value for playing action a in informationset I is v i ( σ, I, a ) = ∑ h ∈ I,z ∈ Z,ha ⊏ z π σ − i ( ha ) π σ ( ha, z ) u i ( z ) .We define S i as a set of sequences of actions only forplayer i . inf ( s i ) , s i ∈ S i is the information set where lastaction of s i was executed and seq i ( I ) , I ∈ I i is sequence ofactions of player i to information set I . Quantal Response Model of Bounded Rationality
Fully rational players always select the utility-maximizingstrategy, i.e., the best response. Relaxing this assumptionleads to a “statistical version” of best response, which takesinto account the inevitable error-proneness of humans andallows the players to make systematic errors (McFadden1976; McKelvey and Palfrey 1995).
Definition 1.
Let G = ( N, A, u ) be an NFG. Function QR ∶ Σ △ → Σ ▽ is a quantal response function of player ▽ if probability of playing action a monotonically increasess expected utility for a increases. Quantal function QR iscalled canonical if for some real-valued function q : QR ( σ, a k ) = q ( u ▽ ( σ, a k ))∑ a i ∈ A ▽ q ( u ▽ ( σ, a i )) ∀ σ ∈ Σ △ , a k ∈ A ▽ . (1)Whenever q is a strictly positive increasing function, thecorresponding QR is a valid quantal response function. Suchfunctions q are called generators of canonical quantal func-tions. The most commonly used generator in the literatureis the exponential (logit) function (McKelvey and Palfrey1995) defined as q ( x ) = e λx where λ > . λ drives themodel’s rationality. The player behaves uniformly randomlyfor λ → , and becomes more rational as λ → ∞ . We denotea logit quantal function as LQR.In EFGs, we assume the bounded-rational player playsbased on a quantal function in every information set sepa-rately, according to the counterfactual values. Definition 2.
Let G be an EFG. Function QR ∶ Σ △ → Σ ▽ isa canonical couterfactual quantal response function of player ▽ with generator q if for a strategy σ △ it produces strategy σ ▽ such that in every information set I ∈ I ▽ , for each action a k ∈ A ( I ) it holds that QR ( σ △ , I, a k ) = q ( v ▽ ( σ, I, a k ))∑ a i ∈ A ( I ) q ( v ▽ ( σ, I, a i )) , (2)where QR ( σ △ , I, a k ) is the probability of playing action a k in information set I and σ = ( σ △ , σ ▽ ) .We denote the canonical counterfactual quantal responsefunction with the logit generator counterfactual logit quan-tal response (CLQR) . CLQR differs from the traditional def-inition of logit agent quantal response (LAQR) (McKelveyand Palfrey 1998) in using counterfactual values instead ofexpected utilities. The main advantage of CLQR over LAQRis that CLQR defines a valid quantal strategy even in infor-mation sets unreachable due to a strategy of the opponent,which is necessary for applying regret-minimization algo-rithms explained later.Because the logit quantal function is the most well-studied function in the literature with several deployed appli-cations (Pita et al. 2008; Delle Fave et al. 2014; Fang et al.2017), we focus most of our analysis and experimental re-sults on (C)LQR. Without a loss of generality, we assumethe quantal player is always player ▽ . Metrics for Evaluating Quality of Strategy
In a two-player zero-sum game, the exploitability of a givenstrategy is defined as expected utility that a fully rationalopponent can achieve above the value of the game. For-mally, exploitability E( σ i ) of strategy σ i ∈ Σ i is E( σ i ) = u − i ( σ i , σ − i ) − u − i ( σ NE ) σ − i ∈ BR − i ( σ i ) . We also intend to measure how much we are able toexploit an opponent’s bounded-rational behavior. For thispurpose, we define gain of a strategy against quantal re-sponse as an expected utility we receive above the value ofthe game. Formally, gain G( σ i ) of strategy σ i is defined as G( σ i ) = u i ( σ i , QR ( σ i )) − u i ( σ NE ) . General-sum games do not have the property that all NEshave the same expected utility. Therefore, we simply mea-sure expected utility against LQR and BR opponents there.
This section formally defines two one-sided bounded-rational equilibria, where one of the players is rationaland the other subrational – a saddle-point-type equilib-rium called Quantal Nash Equilibrium (QNE) and a leader-follower-type equilibrium called Quantal Stackelberg Equi-librium (QSE). We show that contrary to their fully-rational counterparts, QNE differs from QSE even in zero-sum games. Moreover, we show that computing QSE inextensive-form games is an NP-hard problem.
Quantal Equilibria in Normal-form Games
We first consider a variant of NE, in which one of the playersplays a quantal response instead of the best response.
Definition 3.
Given a normal-form game G = ( N, A, u ) and a quantal response function QR , a strategy profile ( σ QNE △ , QR ( σ QNE △ )) ∈ Σ describes a Quantal Nash Equi-librium (QNE) if and only if σ QNE △ is a best response ofplayer △ against quantal-responding player ▽ . Formally: σ QNE △ ∈ BR ( QR ( σ QNE △ )) . (3)QNE can be seen as a concept between NE and QuantalResponse Equilibrium (QRE) (McKelvey and Palfrey 1995).While in NE, both players are fully rational, and in QRE,both players are assumed to behave bounded-rationally, inQNE, one player is rational, and the other is bounded-rational. Theorem 1.
Computing a QNE strategy profile in two-player NFGs is a PPAD-hard problem.Proof (Sketch).
We do a reduction from the problem of com-puting (cid:15) -NE. We derive an upper bound on a maximumdistance between best response and logit quantal response,which goes to zero with λ approaching infinity. For a given (cid:15) , we find λ , such that QNE is (cid:15) -NE. The full proof is pro-vided in the appendix.QNE usually outperforms NE against LQR in practice aswe show in the experiments. However, it cannot be guaran-teed as stated in the Proposition 2. Proposition 2.
For any
LQR function, there exists a zero-sum normal-form game G = ( N, A, u ) with a unique NE - ( σ NE △ , σ NE ▽ ) and QNE - ( σ QNE △ , QR ( σ QNE △ )) such that u △ ( σ NE △ , QR ( σ NE △ )) > u △ ( σ QNE △ , QR ( σ QNE △ )) . The second solution concept is a variant of SSE in situa-tions, when the follower is bounded-rational.
Definition 4.
Given a normal-form game G = ( N, A, u ) anda quantal response function QR , a mixed strategy σ QSE △ ∈ Σ △ describes a Quantal Stackleberg Equilibrium (QSE) ifand only if σ QSE △ = arg max σ △ ∈ Σ △ u △ ( σ △ , QR ( σ △ )) . (4) Full proofs of all propositions are in the appendix. 3 U R E D E L O L W \ R I D F W L R Q ; ( [ S H F W H G X W L O L W \ / 4 5 R E M H F W L Y H ; Y D O X H <