[PDF] Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games

Abstract

We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously rational (converging to the opponent's best response when it uses a stationary policy), convergent (converging to the set of Nash equilibria under self-play), agnostic (no need to know the actions played by the opponent), symmetric (players taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.

Full PDF

aa r X i v : . [ c s . L G ] F e b Last-iterate Convergence of Decentralized Optimistic GradientDescent/Ascent in Inﬁnite-horizon Competitive Markov Games

Chen-Yu Wei

CHENYU . WEI @ USC . EDU

Chung-Wei Lee * LEECHUNG @ USC . EDU

Mengxiao Zhang * MENGXIAO . ZHANG @ USC . EDU

Haipeng Luo

HAIPENGL @ USC . EDU

University of Southern California

Abstract

We study inﬁnite-horizon discounted two-player zero-sum Markov games, and develop a decentral-ized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithmis based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn thepolicies, with a critic that slowly learns the value of each state. To the best of our knowledge, thisis the ﬁrst algorithm in this setting that is simultaneously rational (converging to the opponent’sbest response when it uses a stationary policy), convergent (converging to the set of Nash equilibriaunder self-play), agnostic (no need to know the actions played by the opponent), symmetric (play-ers taking symmetric roles in the algorithm), and enjoying a ﬁnite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.

1. Introduction

Multi-agent reinforcement learning studies how multiple agents should interact with each other andthe environment, and has wide applications in, for example, playing board games (Silver et al.,2017) and real-time strategy games (Vinyals et al., 2019). To model these problems, the frameworkof Markov games (also called stochastic games) (Shapley, 1953) is often used, which can be seenas a generalization of Markov Decision Processes (MDPs) from a single agent to multiple agents.In this work, we focus on one fundamental class: two-player zero-sum Markov games.In this setting, there are many centralized algorithms developed in a line of recent works withnear-optimal sample complexity for ﬁnding a Nash equilibrium (Wei et al., 2017; Sidford et al.,2020; Xie et al., 2020; Bai and Jin, 2020; Zhang et al., 2020; Liu et al., 2020b). These algorithmsrequire a central controller that collects some global knowledge (such as the actions and the rewardsof all players) and then jointly decides the policies for all players. Centralized algorithms are usually convergent (as deﬁned in (Bowling and Veloso, 2001)), in the sense that the policies of the playersconverge to the set of Nash equilibria.On the other hand, there is also a surge of studies on decentralized algorithms that run inde-pendently on each player, requiring only local information such as the player’s own action and thecorresponding reward feedback (Zhang et al., 2019b; Bai et al., 2020; Tian et al., 2020; Liu et al.,2020a; Daskalakis et al., 2020). Compared to centralized ones, decentralized algorithms are usuallymore versatile and can potentially run in different environments (cooperative or competitive). Manyof them enjoy the property of being rational (as deﬁned in (Bowling and Veloso, 2001)), in the * Equal contribution. © C.-Y. Wei, C.-W. Lee, M. Zhang & H. Luo.

AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES sense that a player’s policy converges to the best response to the opponent no matter what stationarypolicy the opponent uses. However, it is also often more challenging to show the convergence to aNash equilibrium when the two players execute the same decentralized algorithm.Our main contribution is to develop the ﬁrst decentralized algorithm that is simultaneously rational , last-iterate convergent (with a concrete ﬁnite-time guarantee), agnostic , and symmet-ric (more details to follow in Section 1.1). Our algorithm is based on Optimistic Gradient De-scent/Ascent (OGDA) (Chiang et al., 2012; Rakhlin and Sridharan, 2013) and importantly relies ona critic that slowly learns a certain value function for each state. Following previous works onlearning MDPs (Abbasi-Yadkori et al., 2019; Agarwal et al., 2019) or Markov games (Perolat et al.,2018), we present the convergence guarantee in terms of the number of iterations of the algorithmand the estimation error of some gradient information (along with other problem-dependent con-stants), where the estimation error can be zero in a full-information setting, or goes down to zerofast enough with additional structural assumptions (e.g. every stationary policy pair induces anirreducible Markov chain, similar to (Auer and Ortner, 2007)).While the OGDA algorithm, ﬁrst studied in (Popov, 1980) under a different name, has beenextensively used in recent years for learning matrix games (a special case of Markov games withone state), to the best of our knowledge, no previous work has applied it to learning Markov gamesand derived a concrete last-iterate convergence rate. Several recent works derive last-iterate con-vergence of OGDA for matrix games (Hsieh et al., 2019; Liang and Stokes, 2019; Mokhtari et al.,2019; Golowich et al., 2020; Wei et al., 2021), and our analysis is heavily inspired by the approachof (Wei et al., 2021). However, the extension to Markov games is highly non-trivial as there isadditional “instability penalty” in the system that we need to handle; see Section 4 for detaileddiscussions. In this section, we discuss and compare related works on learning two-player zero-sum Markovgames. We refer the readers to a thorough survey by (Zhang et al., 2019a) for other topics in multi-agent reinforcement learning.Shapley (1953) ﬁrst introduces the Markov game model and proposes an algorithm analogous tovalue iteration for solving two-player zero-sum Markov games (with all parameters known). Later,Hoffman and Karp (1966) propose a policy iteration algorithm, and Pollatschek and Avi-Itzhak (1969)propose another policy iteration variant that works better in practice but cannot always converge.With the efforts of Van Der Wal (1978) and Filar and Tolwinski (1991), a slight variant of the(Pollatschek and Avi-Itzhak, 1969) algorithm is proposed in (Filar and Tolwinski, 1991) and provento converge. In such a full-information setting where all parameters are know, our algorithm has noestimation error and can also be viewed as a new policy-iteration algorithm.Littman (1994) initiates the study of competitive reinforcement learning under the frameworkof Markov games and proposes an extension of the single-player Q-learning algorithm, calledminimax-Q, which is later proven to converge under some conditions (Szepesv´ari and Littman,1999). While minimax-Q can run in a decentralized manner, it is conservative and only convergesto the minimax policy but not the best response to the opponent.To ﬁx this issue, the work of Bowling and Veloso (2001) argues that a desirable multi-agentlearning algorithm should have the following two properties simultaneously: rational and conver-gent . By their deﬁnition, a rational algorithm converges to its opponent’s best response if the oppo- AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES nent converges to a stationary policy, while a convergent algorithm converges to a Nash equilibriumif both agents use it. They propose the WoLF (Win-or-Learn-Fast) algorithm to achieve this goal,albeit only with empirical evidence. Subsequently, Conitzer and Sandholm (2007); Perolat et al.(2018); Sayin et al. (2020) design decentralized algorithms that provably enjoy these two proper-ties, but only with asymptotic guarantees.Recently, there is a surge of works that provide ﬁnite-time guarantees and characterize the tightsample complexity for ﬁnding Nash equilibria (Perolat et al., 2015; P´erolat et al., 2016; Wei et al.,2017; Sidford et al., 2020; Xie et al., 2020; Zhang et al., 2020; Bai and Jin, 2020; Liu et al., 2020b).These algorithms are all essentially centralized. Below, we focus on comparisons with several recentworks that propose decentralized algorithms and provide ﬁnite-time guarantees. Comparison with R-Max (Brafman and Tennenholtz, 2002), UCSG-online (Wei et al., 2017)and OMNI-VI-online (Xie et al., 2020)

These algorithms, like minimax-Q, converge to the min-imax policy instead of the best response to the opponent, even when the opponent is weak (i.e., notusing its best policy). In other words, these algorithms are not rational. Another drawback of thesealgorithms is that the learner has to observe the actions taken by the opponent. Our algorithm, onthe other hand, is both rational and agnostic to what the opponent plays.

Comparison with Optimistic Nash V-Learning (Bai et al., 2020; Tian et al., 2020)

The NashV-Learning algorithm handles the ﬁnite-horizon tabular case. It runs an exponential-weight algo-rithm on each state, with importance-weighted loss/reward estimators. Since the exponential-weightalgorithm is known to diverge even in matrix games (Bailey and Piliouras, 2018), the iterate of NashV-Learning also diverges. After training, however, Nash V-Learning can output a near-optimal non-Markovian policy with size linear in the training time. In contrast, our algorithm exhibits last-iterateconvergence, and the output is a simple Markovian policy.

Comparison with Smooth-FSP (Liu et al., 2020a)

The Smooth-FSP algorithm handles the func-tion approximation setting. The objective function it optimizes is the original objective plus an en-tropy regularization term. Because of this additional regularization, the players are only guaranteedto converge to some neighborhood of the minimax policy pair (with a constant radius), even whentheir gradient estimation error is zero. In contrast, our algorithm converges to the true minimaxpolicy pair when the gradient estimation error goes to zero.

Comparison with Independent PG (Daskalakis et al., 2020)

This work studies independentpolicy gradient in the tabular case. To achieve last-iterate convergence, the two players have touse asymmetric learning rates, and only the one with a smaller learning rate converges to the min-imax policy. In contrast, the two players of our algorithm are completely symmetric, and theysimultaneously converge to the equilibrium set.

2. Preliminaries

We consider a two-player zero-sum discounted Markov game deﬁned by a tuple ( S , A , B , σ, p, γ ) ,where: 1) S is a ﬁnite state space; 2) A and B are ﬁnite action spaces for Player 1 and Player2 respectively; 3) σ is the loss (payoff) function for Player 1 (Player 2), with σ ( s, a, b ) ∈ [0 , specifying how much Player 1 pays to Player 2 if they are at state s and select actions a and b

1. It is tempting to consider an even stronger rationality notion, that is, having no regret against an arbitrary opponent.This is, however, known to be computationally hard (Radanovic et al., 2019; Bai et al., 2020). AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES respectively; 4) p : S×A×B → ∆ S is the transition function, with p ( s ′ | s, a, b ) being the probabilityof transitioning to state s ′ after actions a and b are taken by the two players respectively at state s ( ∆ S denotes the set of probability distributions over S ); 5) and ≤ γ < is a discount factor. A stationary policy of Player 1 can be described by a function

S → ∆ A that maps each state toan action distribution. We use x s ∈ ∆ A to denote the action distribution for Player 1 on state s , anduse x = { x s } s ∈S to denote the complete policy. We deﬁne y s and y = { y s } s ∈S similarly for Player2. For notational convenience, we further deﬁne z s = ( x s , y s ) ∈ ∆ A × ∆ B as the concatenatedpolicy of the players on state s , and let z = { z s } s ∈S .For a pair of stationary policies ( x, y ) and an initial state s , the expected discounted value thatthe players pay/gain can be represented as V sx,y = E " ∞ X t =1 γ t − σ ( s t , a t , b t ) (cid:12)(cid:12)(cid:12)(cid:12) s = s, a t ∼ x s t , b t ∼ y s t , s t +1 ∼ p ( ·| s t , a t , b t ) , ∀ t ≥ . The minimax game value on state s is then deﬁned as V s⋆ = min x max y V sx,y = max y min x V sx,y . It is known that a pair of stationary policies ( x ⋆ , y ⋆ ) attaining the minimax value on state s isnecessarily attaining the minimax value on all states (Filar and Vrieze, 2012), and we call such x ⋆ a minimax policy, such y ⋆ a maximin policy, and such pair a Nash equilibrium. Further deﬁne X s⋆ = { x s⋆ ∈ x ⋆ : x ⋆ is a minimax policy } and similarly Y s⋆ = { y s⋆ ∈ y ⋆ : y ⋆ is a maximin policy } ,and denote Z s⋆ = X s⋆ × Y s⋆ . It is also known that any x = { x s } s ∈S with x s ∈ X s⋆ for all s is aminimax policy (similarly for y ) (Filar and Vrieze, 2012).For any x s , we denote its distance from X s⋆ as dist ⋆ ( x s ) = min x s⋆ ∈X s⋆ k x s⋆ − x s k , where k v k fora vector v denotes its L norm throughout the paper; similarly, dist ⋆ ( y s ) = min y s⋆ ∈Y s⋆ k y s⋆ − y s k and dist ⋆ ( z s ) = min z s⋆ ∈Z s⋆ k z s⋆ − z s k = q dist ⋆ ( x s ) + dist ⋆ ( y s ) . The projection operator for aconvex set U is deﬁned as Π U { v } = argmin u ∈U k u − v k .We also deﬁne the Q-function on state s under policy pair ( x, y ) as Q sx,y ( a, b ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ x,y i , which can be compactly written as a matrix Q sx,y ∈ R |A|×|B| such that V sx,y = x s ⊤ Q sx,y y s . We write Q s⋆ = Q sx ⋆ ,y ⋆ for any minimax/maximin policy pair ( x ⋆ , y ⋆ ) (which is unique even if ( x ⋆ , y ⋆ ) isnot). Finally, k Q k for a matrix Q is deﬁned as max i,j | Q i,j | . Optimistic Gradient Descent Ascent (OGDA)

As mentioned, our algorithm is based on runningan instance of the OGDA algorithm on each state with an appropriate loss/reward function. Tothis end, here, following the exposition of (Wei et al., 2021) we brieﬂy review OGDA for a matrixgame deﬁned by a matrix Q ∈ R |A|×|B| . Speciﬁcally, OGDA maintains two sequences of actiondistributions b x , b x , . . . ∈ ∆ A and x , x , . . . ∈ ∆ A for Player 1, and similarly two sequences b y , b y , . . . ∈ ∆ B and y , y , . . . ∈ ∆ B for Player 2, following the updates below: b x t +1 = Π ∆ A (cid:8)b x t − ηQy t (cid:9) , x t +1 = Π ∆ A (cid:8)b x t +1 − ηQy t (cid:9) , b y t +1 = Π ∆ B (cid:8)b y t + ηQ ⊤ x t (cid:9) , y t +1 = Π ∆ B (cid:8)b y t +1 + ηQ ⊤ x t (cid:9) , (1)

2. The discount factor is usually some value close to , so we assume that it is no less than for simplicity.3. Note the slight abuse of notation here: the meaning of dist ⋆ ( · ) depends on its input. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES where η is some learning rate. As one can see, unlike the standard Gradient Descent Ascent al-gorithm which simply sets ( x t , y t ) = ( b x t , b y t ) , OGDA takes a further descent/ascent step usingthe latest gradient to obtain ( x t , y t ) , which is then used to evaluate the gradient (of the function f ( x, y ) = x ⊤ Qy ). Wei et al. (2021) prove that the iterate ( b x t , b y t ) (or ( x t , y t ) ) converges to the setof Nash equilibria of the matrix game at a linear rate, which motivates us to generalize it to Markovgames. As we show in the following sections, however, the extensions of both the algorithm and theanalysis are highly non-trivial.We remark that while Wei et al. (2021) also analyze the last-iterate convergence of another al-gorithm called Optimistic Multiplicative Weight Update (OMWU), which is even more commonlyused in ﬁnite-action games, they also show that the theoretical guarantees of OMWU hold undermore limited assumptions (e.g., requiring the uniqueness of the equilibrium), and its empirical per-formance is also inferior to that of OGDA. We therefore only extend the latter to Markov games.

3. Algorithm and Main Results

A natural idea to extend OGDA to Markov games is to run the same algorithm described in Section 2for each state s with the game matrix Q being Q sx t ,y t . However, an important difference is that nowthe game matrix is changing over time. Indeed, if the polices are changing rapidly for subsequentstates, the game matrix Q sx t ,y t will also be changing rapidly, which makes the update on state s highly unstable and in turn causes similar issues for previous states.To resolve this issue, we propose to have a critic slowly learn the value function for each state.Speciﬁcally, for each state s , the critic maintains a sequence of values V s = 0 , V s , V s , . . . . Duringiteration t , instead of using Q sx t ,y t as the game matrix for state s , we use Q st deﬁned via Q st ( a, b ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) [ V s ′ t − ] . Ideally, OGDA would then take the role of an actor and compute x st +1 and b x st +1 using the gradient Q st y st (and similarly y st +1 and b y st +1 using the gradient Q s ⊤ t x st ).Since such exact gradient information is often unknown, we only require the algorithm to come upwith estimations ℓ st and r st such that k ℓ st − Q st y st k ≤ ε and k r st − Q s ⊤ t x st k ≤ ε for some prespeciﬁederror ε (more discussions in Section 3.1). See updates Eq. (2)-Eq. (5) in Algorithm 1. Note thatsimilar to (Wei et al., 2021), we adopt a constant learning rate η (independent of the number ofiterations) in these updates.At the end of each iteration t , the critic then updates the value function via V st = (1 − α t ) V st − + α t ρ st , where ρ st is an estimation of x s ⊤ t Q st y st such that | ρ st − x s ⊤ t Q st y st | ≤ ε . To stabilize the gamematrix, we require the learning rate α t to decrease in t and go to zero. Most of our analysis isconducted under this general condition, and the ﬁnal convergence rate depends on the concreteform of α t , which we set to α t = H +1 H + t with H = − γ inspired by (Jin et al., 2018) (there could bea different choice leading to a better convergence though).Our main results are the following two theorems on the last-iterate convergence of Algorithm 1.

4. For simplicity, here we assume that the two players share the same estimator ρ st (and thus same V st and Q st ). However,our analysis works even if they maintain different versions of ρ st , as long as they are ε -close to x s ⊤ t Q st y st with respectto their own Q st . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Algorithm 1

Optimistic Gradient Descent/Ascent for Markov Games

Parameters : γ ∈ [ , , η ≤ q (1 − γ ) S , ε ∈ h , − γ i . Parameters : a non-increasing sequence { α t } Tt =1 that goes to zero. Initialization : ∀ s ∈ S , arbitrarily initialize b x s = x s ∈ ∆ A and b y s = y s ∈ ∆ B , and set V s ← . for t = 1 , . . . , T do For all s , deﬁne Q st ∈ R |A|×|B| as Q st ( a, b ) , σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − i , and update b x st +1 = Π ∆ A nb x st − ηℓ st o , (2) x st +1 = Π ∆ A nb x st +1 − ηℓ st o , (3) b y st +1 = Π ∆ B nb y st + ηr st o , (4) y st +1 = Π ∆ B nb y st +1 + ηr st o , (5) V st = (1 − α t ) V st − + α t ρ st , (6)where ℓ st , r st , and ρ st are ε -approximations of Q st y st , Q s ⊤ t x st , and x s ⊤ t Q st y st respectively, such that k ℓ st − Q st y st k ≤ ε , k r st − Q s ⊤ t x st k ≤ ε , and | ρ st − x s ⊤ t Q st y st | ≤ ε . endTheorem 1 (Average duality-gap convergence) Algorithm 1 with the choice of α t = H +1 H + t where H = − γ guarantees T T X t =1 max s,x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) = O |S| η (1 − γ ) r log TT + |S|√ ε √ η (1 − γ ) ! . Theorem 2 (Last-iterate convergence)

Algorithm 1 with the choice of α t = H +1 H + t where H = − γ guarantees with b z sT = ( b x sT , b y sT ) , |S| X s ∈S dist ⋆ ( b z sT ) = O (cid:18) |S| η C (1 − γ ) T + εηC (1 − γ ) (cid:19) , where C > is a problem-dependent constant (that always exists) satisfying: for all state s and allpolicy pair z = ( x, y ) , max x ′ ,y ′ ( x s Q s⋆ y ′ s − x ′ s Q s⋆ y s ) ≥ C dist ⋆ ( z s ) . Theorem 1 implies that max s,x ′ ,y ′ ( V s b x t ,y ′ − V sx ′ , b y t ) goes to zero (when both /T and ε go to zero),which in turn implies the convergence of b z t to the set of Nash equilibria, although without a concreterate. Theorem 2, on the other hand, shows a concrete ﬁnite-time convergence rate on the distance of b z sT from the equilibrium set, which goes down at the rate of /T up to the estimation error ε . Theproblem-dependent constant C is similar to the matrix game case analyzed in (Wei et al., 2021), as AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES we will discuss in Section 4. As far as we know, this is the ﬁrst symmetric algorithm with ﬁnite-timelast-iterate convergence for both players simultaneously.

In the full-information setting where all parameters of the Markov game are given, we can calculatethe exact value of Q st y st , Q s ⊤ t x st , and x s ⊤ t Q st y st , making ε = 0 . In this case, our algorithm isessentially a new policy-iteration style algorithm for solving Markov games. However, in a learningsetting where the parameters are unknown, the players need to estimate these quantities based onany feedback from the environments. Here, we discuss how to do so when the players only observetheir current state and their loss/reward after taking an action.Speciﬁcally, in iteration t of our algorithm and with ( x t , y t ) at hand, the two players interact witheach other for a sequence of L steps, following a mixed strategy with a certain amount of uniformexploration deﬁned via: e x st ( a ) = (cid:16) − ε ′ (cid:17) x st ( a ) + ε ′ |A| and e y st ( b ) = (cid:16) − ε ′ (cid:17) y st ( b ) + ε ′ |B| , where ε ′ = (1 − γ ) ε . This generates a sequence of observations { ( s i , a i , σ ( s i , a i , b i )) } Li =1 for Player 1and similarly a sequence of observations { ( s i , b i , σ ( s i , a i , b i )) } Li =1 for Player 2, where a i ∼ e x s i t , b i ∼ e y s i t , and s i +1 ∼ p ( ·| s i , a i , b i ) . Then we construct the estimators as follows: ℓ st ( a ) = P Li =1 [ s i = s, a i = a ] (cid:0) σ ( s, a, b i ) + γV s i +1 t − (cid:1)P Li =1 [ s i = s, a i = a ] , (7) r st ( b ) = P Li =1 [ s i = s, b i = b ] (cid:0) σ ( s, a i , b ) + γV s i +1 t − (cid:1)P Li =1 [ s i = s, b i = b ] , (8) ρ st = P Li =1 [ s i = s ] (cid:0) σ ( s, a i , b i ) + γV s i +1 t − (cid:1)P Li =1 [ s i = s ] . (9)(If any of the denominator is zero, deﬁne the corresponding estimator as zero.) To make surethat these are accurate estimators for every state, we naturally need to ensure that every state isvisited often enough. To this end, we make the following assumption similar to (Auer and Ortner,2007), which essentially requires that the induced Markov chain under any stationary policy pair isirreducible. Assumption 1

There exists µ > such that µ = max x,y max s,s ′ T s → s ′ x,y , where T s → s ′ x,y is theexpected time to reach s ′ from s following the policy pair ( x, y ) . Under this assumption, the following theorem shows that taking L ≈ /ε is enough to ensurethe accuracy of the estimators (see Appendix H for the proof). Theorem 3

If Assumption 1 holds and L = e Ω (cid:16) |A| + |B| (1 − γ ) µε log ( T /δ ) (cid:17) , then the estimators Eq. (7) ,Eq. (8) , and Eq. (9) ensure that with probability at least − δ , k ℓ st − Q st y st k , k r st − Q s ⊤ t x st k , and | ρ st − x s ⊤ t Q st y st | are all of order O ( ε ) for all t . Together with Theorem 1 and Theorem 2, given a ﬁxed number of interactions between theplayers, we can now determine optimally how many iterations we should run our algorithm (and

5. We use e Ω to hide logarithmic factors except for T and /δ . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES consequently how large we should set ε ). Equivalently, we show below how many iterations or totalinteractions are need to achieve a certain accuracy. (The choice of α t is the same as in Theorem 1and Theorem 2.) Corollary 4

If Assumption 1 holds, then running Algorithm 1 with estimators Eq. (7) , Eq. (8) ,Eq. (9) and L = e Ω (cid:16) ( |A| + |B| ) |S| (1 − γ ) µη ξ log ( T /δ ) (cid:17) for T = e Ω (cid:16) |S| η (1 − γ ) ξ (cid:17) iterations ensures withprobability at least − δ , T P Tt =1 max s,x ′ ,y ′ ( V s b x t ,y ′ − V sx ′ , b y t ) ≤ ξ . Ignoring other dependence, thisrequires e Ω(1 /ξ ) interactions in total. Corollary 5

If Assumption 1 holds, then running Algorithm 1 with estimators Eq. (7) , Eq. (8) ,Eq. (9) and L = e Ω (cid:16) |A| + |B| (1 − γ ) µη C ξ log ( T /δ ) (cid:17) for T = Ω (cid:16) |S| η C (1 − γ ) ξ (cid:17) iterations ensures withprobability at least − δ , |S| P s ∈S dist ⋆ ( b z sT ) ≤ ξ . Ignoring other dependence, this requires e Ω(1 /ξ ) interactions in total. Finally, we argue that from the perspective of a single player (take Player 1 as an example), ouralgorithm is also rational, in the sense that it allows Player 1 to converge to the best response to heropponent if Player 2 is not applying our algorithm but instead uses an arbitrary stationary policy. We show this single-player-perspective version in Algorithm 2, where Player 1 still follows theupdates Eq. (2), Eq. (3), and Eq. (6), while y t is ﬁxed to a stationary policy y used by Player 2.In fact, thanks to the agnostic nature of our algorithm, rationality is essentially an implicationof the convergence property. To see this, consider a modiﬁed two-player Markov game with thedifference being that the opponent has only a single action (call it ) on each state, the loss functionis redeﬁned as σ ( s, a,

1) = E b ∼ y s [ σ ( s, a, b )] , and the transition kernel is redeﬁned as p ( s ′ | s, a,

1) = E b ∼ y s [ p ( s ′ | s, a, b )] . It is straightforward to see that following our algorithm, Player 1’s behaviors inthe original game and in the modiﬁed game are exactly the same. On the other hand, in the modiﬁedgame, since Player 2 has only one action (and thus one strategy), she can also be seen as using ouralgorithm. Therefore, we can apply our convergent guarantees to the modiﬁed game, and since theminimax policy in the modiﬁed game is exactly the best response in the original game, we knowthat Player 1 indeed converges to the best response. We summarize these rationality guarantees inthe following theorem, with the formal proof deferred to Appendix I. Theorem 6

Algorithm 2 with the choice of α t = H +1 H + t where H = − γ guarantees T T X t =1 max s,x ′ (cid:16) V s b x t ,y s − V sx ′ ,y s (cid:17) = O |S| η (1 − γ ) r log TT + |S|√ ε √ η (1 − γ ) ! , and for X BR = n x : V sx,y = min x ′ V sx ′ ,y , ∀ s ∈ S o and some problem-dependent constant C ′ > , |S| X s ∈S k b x sT − Π X BR { b x sT }k = O (cid:18) |S| η C ′ (1 − γ ) T + εηC ′ (1 − γ ) (cid:19) .

6. The rationality deﬁned by Bowling and Veloso (2001) requires that the learner converges to the best response as longas the opponent converges to a stationary policy. While our algorithm does handle this case, as a proof of concept,we only consider the simpler scenario where the opponent simply uses a stationary policy. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

4. Analysis Overview

In this section, we give an overview of how we analyze Algorithm 1 and prove Theorem 1 andTheorem 2. We start by giving a quick review of the analysis of (Wei et al., 2021) for matrix games,and then highlight how we overcome the challenges when generalizing it to Markov games.

Review for matrix games

Recall the update in Eq. (1) for a ﬁxed matrix Q . Wei et al. (2021)show the following two convergence guarantees:1. Average duality-gap convergence: T T X t =1 ∆( b z t ) = O (cid:18) η √ T (cid:19) (10)where ∆( z ) = max x ′ ,y ′ (cid:0) x ⊤ Qy ′ − x ′⊤ Qy (cid:1) is the duality gap of z = ( x, y ) .2. Last-iterate convergence: dist ⋆ ( b z t ) ≤ C dist ⋆ ( b z ) (cid:0) η C (cid:1) − t (11)where dist ⋆ ( z ) is the distance from z to the set of equilibria, C is a universal constant, and C > is a positive constant that depends on Q . The analysis of (Wei et al., 2021) starts from the following single-step inequality that follows thestandard Online Mirror Descent analysis and describes the relation between dist ⋆ ( b z t +1 ) and dist ⋆ ( b z t ) : dist ⋆ ( b z t +1 ) ≤ dist ⋆ ( b z t ) + η k z t − z t − k | {z } instability penalty − (cid:16) k b z t +1 − z t k + k z t − b z t k (cid:17)| {z } instability bonus . (12)The instability penalty term makes dist ⋆ ( b z t +1 ) larger if k z t − z t − k is large, while the instabilitybonus term makes dist ⋆ ( b z t +1 ) smaller if either k b z t +1 − z t k or k z t − b z t k is large. To obtain Eq. (10),Wei et al. (2021) make the observation that the instability bonus term is lower bounded by a constanttimes the squared duality gap of b z t +1 , that is, k b z t +1 − z t k + k z t − b z t k & η ∆ ( b z t +1 ) , and thus dist ⋆ ( b z t +1 ) ≤ dist ⋆ ( b z t ) + η k z t − z t − k | {z } instability penalty − (cid:16) k b z t +1 − z t k + k z t − b z t k (cid:17)| {z } instability bonus − Ω( η ∆ ( b z t +1 )) . (13)By taking η ≤ , summing over t , canceling the penalty term with the bonus term, telescoping andrearranging, we get P Tt =1 ∆ ( b z t ) ≤ O (1 /η ) . An application of Cauchy-Schwarz inequality thenproves Eq. (10).To further obtain Eq. (11), Wei et al. (2021) prove that there exists some problem-dependentconstant C > such that for all z , ∆( z ) ≥ C dist ⋆ ( z ) . This, when combined with Eq. (13), shows dist ⋆ ( b z t +1 ) ≤ dist ⋆ ( b z t )1 + Ω( η C ) + η k z t − z t − k − Ω (cid:16) k b z t +1 − z t k + k z t − b z t k (cid:17) . (14)

7. This is not to be confused with the constant C in Theorem 2. We overload the notation because they indeed play thesame role in the analysis. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

By upper bounding k z t − z t − k ≤ k z t − b z t k + 2 k b z t − z t − k and rearranging, they further obtain: dist ⋆ ( b z t +1 ) + c ′ k b z t +1 − z t k + c ′ k z t − b z t k ≤ dist ⋆ ( b z t ) + c ′ k b z t − z t − k + c ′ k z t − − b z t − k η C ) (15)for some universal constant c ′ , which clearly indicates the linear convergence of dist ⋆ ( b z t ) and henceproves Eq. (11). Overview of our proofs

We are now ready to show the high-level ideas of our analysis. Forsimplicity, we consider the case with ε = 0 and also assume that there is a unique equilibrium ( x ⋆ , y ⋆ ) (these assumptions are removed in the formal proofs). Our analysis follows the steps below. Step 1 (Appendix B)

Similar to Eq. (12), we conduct a single-step analysis for OGDA in Markovgames (Lemma 24), which shows for all state s : dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) + η (cid:13)(cid:13) z st − z st − (cid:13)(cid:13) − (cid:16)(cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k (cid:17) + 8 η (cid:13)(cid:13) Q st − Q st +1 (cid:13)(cid:13) + 4 η k Q st − Q s⋆ k . (16)Comparing this with Eq. (12), we see that, importantly, since the game matrix Q st is changing overtime, we have two extra instability penalty terms: η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) and η k Q st − Q s⋆ k . Our hope isto further upper bound these two penalty terms by something related to k z st − z st +1 k , so that theycan again be canceled by the bonus term − ( (cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k ) . Indeed, in Steps 3-5, weshow that part of them can be bounded by a weighted sum of {k z s ′ τ − z s ′ τ +1 k } s ′ ∈S ,τ ≤ t . Step 2 (Appendix C): Lower bounding (cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k . As in Eq. (13), we aim tolower bound the instability bonus term by the duality gap. However, since the updates are based on Q st instead of Q s⋆ , we can only relate the bonus term to the duality gap with respect to Q st . To furtherrelate this to the duality gap with respect to Q s⋆ , we pay a quantity related to k Q st − Q s⋆ k . Formally,we show in Lemma 25: (cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k & Ω( η ∆ ( b z st +1 )) − O ( η k Q st − Q s⋆ k ) , where ∆( z s ) , max x ′ s ,y ′ s ( x s Q s⋆ y ′ s − x ′ s Q s⋆ y s ) is the duality gap on state s with respect to Q s⋆ . Step 3 (Appendix D): Upper bounding (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) . (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) is upper bounded by γ max s ′ ( V s ′ t − V s ′ t − ) by the deﬁnition of Q st . Furthermore, V s ′ t − V s ′ t − is a weighted sum of { ρ s ′ τ − ρ s ′ τ − } t − τ =1 by the deﬁnition of V s ′ t , and also ρ s ′ τ − ρ s ′ τ − = x s ′ τ Q s ′ τ y s ′ τ − x s ′ τ − Q s ′ τ − y s ′ τ − = O ( k z s ′ τ − z s ′ τ − k + k Q s ′ τ − Q s ′ τ − k ) . In sum, one can upper bound k Q st +1 − Q st k by a weighted sumof k z s ′ τ − z s ′ τ − k and k Q s ′ τ − Q s ′ τ − k . After formalizing the above relations, we obtain the followinginequality (see Lemma 28): (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) ≤ max s ′ γ (1 − γ ) t X τ =1 α τt k z s ′ τ − z s ′ τ − k + max s ′ γ γ t X τ =1 α τt k Q s ′ τ − Q s ′ τ − k (17)for some coefﬁcient α τt deﬁned in Appendix A.2. With recursive expansion, the above implies that (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) can be upper bounded by a weighted sum of k z s ′ τ − z s ′ τ − k for s ′ ∈ S and τ ≤ t .

8. Similar to the notation dist ⋆ ( · ) , we also omit writing the s dependence for the function ∆( · ) . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Step 4 (Appendix E): Upper bounding k Q st − Q s⋆ k (Part 1). We ﬁrst upper bound k Q st − Q s⋆ k with respect to the following weighted-regret quantityReg t , max s max ( t X τ =1 α τt ( x sτ − x s⋆ ) Q sτ y sτ , t X τ =1 α τt x sτ Q sτ ( y s⋆ − y sτ ) ) . To do so, we deﬁne Γ t = max s k Q st − Q s⋆ k and show for the same coefﬁcient α τt mentioned earlier, V st = t X τ =1 α τt ρ sτ = t X τ =1 α τt x sτ Q sτ y sτ ≤ t X τ =1 α τt x s⋆ Q sτ y sτ + Reg t ≤ t X τ =1 α τt x s⋆ Q s⋆ y sτ + t X τ =1 α τt Γ τ + Reg t ≤ t X τ =1 α τt x s⋆ Q s⋆ y s⋆ + t X τ =1 α τt Γ τ + Reg t = V s⋆ + t X τ =1 α τt Γ τ + Reg t where the last inequality is by the fact P tτ =1 α τt = 1 . Using the deﬁnition of Q st again, we thenhave Q st +1 ( a, b ) − Q s⋆ ( a, b ) = γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − V s ′ ⋆ i ≤ γ ( P tτ =1 α τt Γ τ + Reg t ) . By the samereasoning, we can also show Q st +1 ( a, b ) − Q s⋆ ( a, b ) ≥ − γ ( P tτ =1 α τt Γ τ + Reg t ) , and therefore weobtain the following recursive relation (Lemma 29) Γ t +1 = max s k Q st +1 − Q s⋆ k ≤ γ t X τ =1 α τt Γ τ + Reg t ! . (18) Step 5 (Appendix E): Upper bounding k Q st − Q s⋆ k (Part 2). In this step, we further relate Reg t to {k z s ′ τ − z s ′ τ − k } τ ≤ t,s ′ ∈S . From a one-step regret analysis of OGDA, we have the following (forPlayer 1): ( x st − x s⋆ ) Q st y st ≤ η (cid:16) dist ⋆ ( b x st ) − dist ⋆ ( b x st +1 ) (cid:17) + 4 η (1 − γ ) k y st − y st − k + 4 η k Q st − Q st − k . Recall that Reg t is deﬁned via a weighted sum of the left-hand side above with weights α τt . There-fore, we take the weighted sum of the above and bound P tτ =1 α τt ( x sτ − x s⋆ ) Q sτ y sτ by α t dist ⋆ ( b x s )2 η + t X τ =1 α τt η (cid:0) dist ⋆ ( b x sτ ) − dist ⋆ ( b x sτ +1 ) (cid:1) + 4 η (1 − γ ) t X τ =1 α τt k y sτ − y sτ − k + 4 η t X τ =1 α τt k Q sτ − Q sτ − k ≤ η t X τ =1 α τt α τ − dist ⋆ ( b z sτ ) | {z } term + 4 η (1 − γ ) t X τ =1 α τt k z sτ − z sτ − k | {z } term + 4 η t X τ =1 α τt k Q sτ − Q sτ − k | {z } term (19)where in the inequality we rearrange the ﬁrst summation and use the fact α τt − α τ − t ≤ α τ − α τt (see the formal proof in Lemma 30). Since the case for P tτ =1 α τt x sτ Q sτ ( y s⋆ − y sτ ) is similar, by thedeﬁnition of Reg t , we conclude that Reg t is upper bounded by the maximum over s of the sum ofthe three terms in Eq. (19). Note that, term is itself a weighted sum of {k z sτ − z sτ − k } τ ≤ t , and term can also be upper bounded by a weighted sum of {k z s ′ τ − z s ′ τ − k } τ ≤ t,s ′ ∈S as we alreadyshowed in Step 3. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Combining all steps.

Summing up Eq. (16) over all s , and based on all earlier discussions, wehave X s dist ⋆ ( b z st +1 ) ≤ X s dist ⋆ ( b z st ) + t X τ =1 X s µ sτ α τ − dist ⋆ ( b z sτ ) | {z } term + t X τ =1 X s ν sτ k z sτ − z sτ − k | {z } term − X s (cid:16)(cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k (cid:17)| {z } term − Ω η X s ∆ ( b z st +1 ) ! (20)for some weights µ sτ and ν sτ (a large part of the analysis is devoted to precisely calculating theseweights). Here, the − Ω (cid:0) η P s ∆ ( b z st +1 ) (cid:1) term comes from Step 2; term is a weighted sum of { α τ − dist ⋆ ( b z s ′ τ ) } τ ≤ t,s ′ ∈S that comes from term in Step 5; term is a weighed sum of {k z s ′ τ − z s ′ τ − k } τ ≤ t,s ′ ∈S that comes from all other terms we discuss in Steps 3-5. Obtaining average duality-gap bound

To obtain the average duality-gap bound in Theorem 1,we sum Eq. (20) over t , and further argue that the sum of term over t is smaller than the sum of term over t (hence they are canceled with each other). Rearranging and telescoping leads to η T X t =1 X s ∆ ( b z st +1 ) = O T X t =1 t X τ =1 X s µ sτ α τ − dist ⋆ ( b z sτ ) ! = O T X t =1 t X τ =1 X s µ sτ α τ − ! . As long as α t is decreasing and going to zero, the right-hand side above can be shown to be sub-linear in T . Further relating max x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) to ∆( b z st ) (Lemma 32) proves Theorem 1. Obtaining last-iterate convergence bound

Following the matrix game case, there is a problem-dependent constant

C > such that ∆( b z st +1 ) ≥ C dist ⋆ ( b z st +1 ) . Similarly to how Eq. (14) is ob-tained, we use this in Eq. (20) and arrive at X s dist ⋆ ( b z st +1 ) ≤

11 + Ω( η C ) X s dist ⋆ ( b z st ) + t X τ =1 X s µ sτ α τ − dist ⋆ ( b z sτ ) | {z } term + t X τ =1 X s ν sτ k z sτ − z sτ − k | {z } term − Ω X s (cid:16)(cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k (cid:17)| {z } term ! (21)Then ideally we would like to follow a similar argument from Eq. (14) to Eq. (15) to obtain a last-iterate convergence guarantee. However, we face two more challenges here. First, we have an extra term . Fortunately, this term vanishes when t is large as long as α t decreases and converges tozero. Second, in Eq. (14), the indices of the negative term k b z t +1 − z t k + k z t − b z t k and the positiveterm η k z t − z t − k are only offset by so that a simple rearrangement is enough to get Eq. (15),while in Eq. (21), the indices in term and term are far from each other. To address this issue, wefurther introduce a set of weights and consider a weighted sum of Eq. (21) over t . We then showthat the weighted sum of term can be canceled by the weighted sum of term . Combining theabove proves Theorem 2. Note that due to these extra terms, our last-iterate convergence rate isonly sublinear (while Eq. (11) shows a linear rate for matrix games). AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

5. Conclusion and Future Directions

In this work, we propose the ﬁrst decentralized algorithm for two-player zero-sum Markov gamesthat is rational, convergent, agnostic, symmetric, and having a ﬁnite-time convergence rate guaranteeat the same time. The algorithm is based on running OGDA on each state, together with a slowlychanging critic that stabilizes the game matrix on each state.Our work studies the most basic tabular setting, and also requires a structural assumption whenestimation is needed that sidesteps the difﬁculty of performing exploration over the state space.Important future directions include relaxing either of these assumptions, that is, extending ourframework to allow function approximation and/or incorporating efﬁcient exploration mechanisms.Studying OGDA-based algorithms beyond the two-player zero-sum setting is also an interestingfuture direction.

References

Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gell´ertWeisz. Politex: Regret bounds for policy iteration using expert prediction. In

InternationalConference on Machine Learning , 2019.Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradi-ent methods: Optimality, approximation, and distribution shift. arXiv preprint arXiv:1908.00261 ,2019.Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcementlearning. In B. Sch¨olkopf, J. Platt, and T. Hoffman, editors,

Advances in Neural InformationProcessing Systems , 2007.Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. arXivpreprint arXiv:2002.04017 , 2020.Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play.

Advancesin Neural Information Processing Systems , 2020.James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In

Proceedings of the 2018 ACM Conference on Economics and Computation , 2018.Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. In

Proceedings of the 17th international joint conference on Artiﬁcial intelligence , 2001.Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning.

Journal of Machine Learning Research , 2002.Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, andShenghuo Zhu. Online optimization with gradual variations. In

Conference on Learning Theory ,2012.Vincent Conitzer and Tuomas Sandholm. Awesome: A general multiagent learning algorithm thatconverges in self-play and learns a best response against stationary opponents.

Machine Learning ,2007. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Constantinos Daskalakis, Dylan J. Foster, and Noah Golowich. Independent policy gradient meth-ods for competitive reinforcement learning. In

Advances in neural information processing sys-tems , 2020.Jerzy Filar and Koos Vrieze.

Competitive Markov decision processes . Springer Science & BusinessMedia, 2012.Jerzy A Filar and Boleslaw Tolwinski. On the algorithm of pollatschek and avi-ltzhak. 1991.Andrew Gilpin, Javier Pena, and Tuomas Sandholm. First-order algorithm with O (ln(1 /ǫ )) conver-gence for ǫ -equilibrium in two-person zero-sum games. Mathematical programming , 2012.Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence ratesfor no-regret learning in multi-player games.

Advances in neural information processing systems ,2020.Alan J Hoffman and Richard M Karp. On nonterminating stochastic games.

Management Science ,1966.Yu-Guan Hsieh, Franck Iutzeler, J´erˆome Malick, and Panayotis Mertikopoulos. On the convergenceof single-call stochastic extra-gradient methods. In

Advances in Neural Information ProcessingSystems , 2019.Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efﬁ-cient? In

Advances in Neural Information Processing Systems , 2018.Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local conver-gence of generative adversarial networks. In

The 22nd International Conference on ArtiﬁcialIntelligence and Statistics , 2019.Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In

Machine learning proceedings . 1994.Boyi Liu, Zhuoran Yang, and Zhaoran Wang. Policy optimization in zero-summarkov games: Fictitious self-play provably attains nash equilibria, 2020a. URL https://openreview.net/forum?id=c3MWGN_cTf .Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcementlearning with self-play. arXiv preprint arXiv:2010.01604 , 2020b.Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A uniﬁed analysis of extra-gradient andoptimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprintarXiv:1901.08511 , 2019.Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programmingfor two-player zero-sum markov games. In

International Conference on Machine Learning , 2015.Julien P´erolat, Bilal Piot, Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. Softened approx-imate policy iteration for markov games. In

International Conference on Machine Learning ,2016. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Julien Perolat, Bilal Piot, and Olivier Pietquin. Actor-critic ﬁctitious play in simultaneous movemultistage games. In

International Conference on Artiﬁcial Intelligence and Statistics , 2018.MA Pollatschek and B Avi-Itzhak. Algorithms for stochastic games with geometrical interpretation.

Management Science , 1969.Leonid Denisovich Popov. A modiﬁcation of the arrow-hurwicz method for search of saddle points.

Mathematical notes of the Academy of Sciences of the USSR , 1980.Goran Radanovic, Rati Devidze, David Parkes, and Adish Singla. Learning to collaborate in markovdecision processes. In

International Conference on Machine Learning , 2019.Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable se-quences. In

Advances in Neural Information Processing Systems , 2013.Muhammed O Sayin, Francesca Parise, and Asuman Ozdaglar. Fictitious play in zero-sum stochas-tic games. arXiv preprint arXiv:2010.04223 , 2020.Lloyd S Shapley. Stochastic games.

Proceedings of the national academy of sciences , 1953.Aaron Sidford, Mengdi Wang, Lin Yang, and Yinyu Ye. Solving discounted stochastic two-playergames with near-optimal time and sample complexity. In

International Conference on ArtiﬁcialIntelligence and Statistics , 2020.David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of gowithout human knowledge.

Nature , 2017.Csaba Szepesv´ari and Michael L Littman. A uniﬁed analysis of value-function-based reinforcement-learning algorithms.

Neural computation , 1999.Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Provably efﬁcient online agnostic learningin markov games. arXiv preprint arXiv:2010.15020 , 2020.J Van Der Wal. Discounted markov games: Generalized policy iteration method.

Journal of Opti-mization Theory and Applications , 1978.Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha¨el Mathieu, Andrew Dudzik, Juny-oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmasterlevel in starcraft ii using multi-agent reinforcement learning.

Nature , 2019.Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In

Advances in Neural Information Processing Systems , 2017.Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergencein constrained saddle-point optimization.

International Conference on Learning Representations ,2021.Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. arXiv preprintarXiv:2002.07066 , 2020. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Kaiqing Zhang, Zhuoran Yang, and Tamer Bas¸ar. Multi-agent reinforcement learning: A selectiveoverview of theories and algorithms. arXiv preprint arXiv:1911.10635 , 2019a.Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Policy optimization provably converges to nashequilibria in zero-sum linear quadratic games. In

Advances in Neural Information ProcessingSystems , 2019b.Kaiqing Zhang, Sham M Kakade, Tamer Bas¸ar, and Lin F Yang. Model-based multi-agent rl inzero-sum markov games with near-optimal sample complexity. arXiv preprint arXiv:2007.07461 ,2020.

Appendix A. Notations

A.1. Simpliﬁcations of the Notations

We deﬁne the following notations to simplify the proofs:

Deﬁnition 7 b x s = x s = |A| (zero vector with dimension |A| ), b y s = y s = |B| , Q s = |A|×|B| , ℓ s = |A| , r s = |B| , ρ s = 0 , α = 1 . Besides, for a matrix Q , we deﬁne k Q k = max i,j | Q ij | . To avoid cluttered notation, a product ofthe form x ⊤ Qy is usually simply written as xQy . A.2. Auxiliary Coefﬁcients

In this subsection, we deﬁne several coefﬁcients that are related to the value learning rate { α t } . Deﬁnition 8 ( α τt ) For non-negative integers τ and t with τ ≤ t , deﬁne α τt = α τ Q ti = τ +1 (1 − α i ) . Deﬁnition 9 ( δ τt ) For non-negative integers τ and t with τ ≤ t , deﬁne δ τt , Q ti = τ +1 (1 − α i ) . Deﬁnition 10 ( β τt ) For positive integers τ and t with τ < t , deﬁne β τt = α τ Q t − i = τ (1 − α i + α i γ ) .Deﬁne β tt = 1 . Deﬁnition 11 ( λ t ) For positive integers t , deﬁne λ t = max n α t +1 α t , − α t (1 − γ )2 o . Deﬁnition 12 ( λ τt ) For positive integers τ and t with τ < t , deﬁne λ τt = α τ Q t − i = τ λ i . Deﬁne λ tt = 1 . A.3. Auxiliary Variables

In this subsection, we deﬁne several auxiliary variables to be used in the later analysis.

Deﬁnition 13 ( J st ) For every state s ∈ S , deﬁne the sequence { J st } t =1 , ,... by J s = k z s − z s k ,J st = (1 − α t ) J st − + α t (cid:13)(cid:13) z st − z st − (cid:13)(cid:13) , ∀ t ≥ . Furthermore, deﬁne J t , max s J st . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Deﬁnition 14 ( K st ) For every state s ∈ S , deﬁne the sequence { K st } t =1 , ,... by K s = k Q s − Q s k ,K st = (1 − α t ) K st − + α t (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) , ∀ t ≥ . Furthermore, deﬁne K t , max s K st . Deﬁnition 15 ( b x st⋆ , b y st⋆ , b z st⋆ ) Deﬁne b x st⋆ = Π X s⋆ ( b x st ) , i.e., the projection of b x st onto the set of optimalpolicy X s⋆ on state s . Similarly, b y st⋆ = Π Y s⋆ ( b y st ) , and b z st⋆ = Π Z s⋆ ( b z st ) = ( b x st⋆ , b y st⋆ ) . Deﬁnition 16 ( ∆ st ) Deﬁne ∆ st = max x ′ ,y ′ ( b x st Q s⋆ y ′ s − x ′ s Q s⋆ b y st ) for all t ≥ . Deﬁnition 17 (Reg st ) Deﬁne

Reg st = max ( t X τ =1 α τt ( x sτ − b x st⋆ ) Q sτ y sτ , t X τ =1 α τt x sτ Q sτ ( b y st⋆ − y sτ ) ) and Reg t = max s Reg st . Deﬁnition 18 ( Γ t ) Deﬁne Γ t = max s k Q st − Q s⋆ k . Deﬁnition 19 ( θ st ) Deﬁne θ st = k b z st − z st − k + k z st − − b z st − k Deﬁnition 20 ( Z t ) Deﬁne Z t = max s P tτ =1 α τt α τ − dist ⋆ ( b z sτ ) . A.4. Assumptions on α t and Simple Facts about α τt We require α t to satisfy the following:• α = 1 • < α t +1 ≤ α t ≤ • α t → as t → ∞ Furthermore, α , . Below is an useful lemma that is used in many places: Lemma 21 If { h t } t =0 , , ,... and { k t } t =1 , ,... are non-negative sequences that satisfy h t = (1 − α t ) h t − + α t k t for t ≥ , then h t = P tτ =1 α τt k τ . Proof

We prove it by induction. When t = 1 , since α = 1 , h = k = α k . Assume that theformula is correct for h t . Then h t +1 = (1 − α t +1 ) h t + α t +1 k t +1 = (1 − α t +1 ) t X τ =1 α τt k τ + α t +1 t +1 k t +1 = t X τ =1 α τt +1 k τ + α t +1 t +1 k t +1 = t +1 X τ =1 α τt +1 k τ . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Corollary 22

The following hold: • V st = P tτ =1 α τt ρ sτ • J st = P tτ =1 α τt (cid:13)(cid:13) z sτ − z sτ − (cid:13)(cid:13) • K st = P tτ =1 α τt (cid:13)(cid:13) Q sτ − Q sτ − (cid:13)(cid:13) Proof

They immediately follow from Lemma 21 and the deﬁnition of J st , K st , V st . Appendix B. Proof for Step 1: Single-Step Inequality

Lemma 23

For any state s and t , ( x st − b x st⋆ ) Q st y st ≤ η (cid:16) dist ⋆ ( b x st ) − dist ⋆ ( b x st +1 ) − k b x st +1 − x st k − k x st − b x st k (cid:17) + 4 η (1 − γ ) k y st − y st − k + 4 η k Q st − Q st − k + 3 ε,x st Q st ( b y st⋆ − y st ) ≤ η (cid:16) dist ⋆ ( b y st ) − dist ⋆ ( b y st +1 ) − k b y st +1 − y st k − k y st − b y st k (cid:17) + 4 η (1 − γ ) k x st − x st − k + 4 η k Q st − Q st − k + 3 ε. Proof

By standard proof of OGDA (see, e.g., the proof of Lemma 1 in (Wei et al., 2021) or Lemma1 in (Rakhlin and Sridharan, 2013)), we have ( x st − b x st⋆ ) ⊤ ℓ st ≤ η (cid:16) k b x st − b x st⋆ k − (cid:13)(cid:13)b x st +1 − b x st⋆ (cid:13)(cid:13) − k b x st +1 − x st k − k x st − b x st k (cid:17) + η k ℓ st − ℓ st − k . Since k b x st − b x st⋆ k = dist ⋆ ( b x st ) and (cid:13)(cid:13)b x st +1 − b x st⋆ (cid:13)(cid:13) ≥ dist ⋆ ( b x st +1 ) by the deﬁnition of dist ⋆ ( · ) , wefurther have ( x st − b x st⋆ ) ⊤ ℓ st ≤ η (cid:16) dist ⋆ ( b x st ) − dist ⋆ ( b x st +1 ) − k b x st +1 − x st k − k x st − b x st k (cid:17) + η k ℓ st − ℓ st − k . (22)By the deﬁnition of ℓ st , we have η (cid:13)(cid:13) ℓ st − ℓ st − (cid:13)(cid:13) ≤ η (cid:13)(cid:13) ℓ st − Q st y st + ( Q st − Q st − ) y st + Q st − ( y st − y st − ) + Q st − y st − − ℓ st − (cid:13)(cid:13) ≤ η k ℓ st − Q st y st k + 4 η (cid:13)(cid:13) ( Q st − Q st − ) y st (cid:13)(cid:13) + 4 η (cid:13)(cid:13) Q st − ( y st − y st − ) (cid:13)(cid:13) + 4 η (cid:13)(cid:13) Q st − y st − − ℓ st − (cid:13)(cid:13) ≤ η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) + 4 η (1 − γ ) (cid:13)(cid:13) y st − y st − (cid:13)(cid:13) + 8 ηε and ( x st − b x st⋆ ) Q st y st ≤ ( x st − b x st⋆ ) ℓ st + 2 ε. Combining them with Eq. (22) and the fact that ηε ≤ η − γ ≤ , we get the ﬁrst inequality that wewant to prove. The other inequality is similar. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Lemma 24

For all t ≥ , dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) − θ st +1 + θ st + 4 η Γ t + 8 η k Q st − Q st − k + 6 ηε. Proof

Summing up the two inequalities in Lemma 23, we get η ( x st − b x st⋆ ) Q st y st + 2 ηx st Q st ( b y st⋆ − y st ) ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 4 η (1 − γ ) k z st − z st − k + 8 η k Q st − Q st − k − k b z st +1 − z st k − k z st − b z st k + 6 ηε ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 132 k z st − z st − k + 8 η k Q st − Q st − k − k b z st +1 − z st k − k z st − b z st k + 6 ηε ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 116 (cid:0) k z st − b z st k + k b z st − z st − k (cid:1) + 8 η k Q st − Q st − k − k b z st +1 − z st k − k z st − b z st k + 6 ηε = dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 8 η k Q st − Q st − k − k z st − b z st k − k b z st +1 − z st k + 116 k b z st − z st − k + 6 ηε The left-hand side above, can be lower bounded by η ( x st − b x st⋆ ) Q st y st + 2 ηx st Q st ( b y st⋆ − y st ) = 2 ηx st Q st b y st⋆ − η b x st⋆ Q st y st ≥ ηx st Q s⋆ b y st⋆ − η b x st⋆ Q s⋆ y st − η Γ t ≥ − η Γ t . (by the optimality of b x st⋆ and b y st⋆ )Combining the inequalities and using the deﬁnition of θ st ﬁnish the proof. Appendix C. Proof for Step 2: Lower Bounding k b z st +1 − z st k + k z st − b z st k Lemma 25

For all t ≥ , we have θ st +1 + η Γ t + 2 η ǫ ≥ η (cid:0) ∆ st +1 (cid:1) . Proof

By Eq. (2) and the optimality condition for b x st +1 , we have ( b x st +1 − b x st + ηℓ st ) · ( x ′ s − b x st +1 ) ≥ (23)for any x ′ s ∈ ∆ A . Then by the deﬁnition of ℓ st , ( b x st +1 − b x st + ηQ st y st ) · ( x ′ s − b x st +1 ) ≥ ( b x st +1 − b x st + ηℓ st ) · ( x ′ s − b x st +1 ) − ηε ≥ − ηε (24)where in the last inequality we use Eq. (23). Thus we have for any x ′ s ∈ ∆ A , √ k b x st +1 − x st k + k x st − b x st k ) ≥ √ k b x st +1 − b x st k≥ k b x st +1 − b x st k ≥ ( b x st +1 − b x st ) · ( x ′ s − b x st +1 ) ≥ η ( b x st +1 − x ′ s ) Q st y st − ηε (by Eq. (24)) = η ( x st − x ′ s ) Q st y st + η ( b x st +1 − x st ) Q st y st − ηε ≥ η ( x st − x ′ s ) Q st y st − η k b x st +1 − x st k − γ − ηε. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Using the fact that η − γ ≤ , we get k b x st +1 − x st k + k x st − b x st k + √ ηε ≥ √ (cid:18) η max x ′ ( x st − x ′ s ) Q st y st (cid:19) ≥ η x ′ ( x st − x ′ s ) Q st y st . Similarly, we have k b y st +1 − y st k + k y st − b y st k + √ ηε ≥ η max y ′ x st Q st ( y ′ s − y st ) . Combining themand using k z − z ′ k ≥ k x − x ′ k + k y − y ′ k , we get k b z st +1 − z st k + k z st − b z st k + √ ηε ≥ η (cid:18) max x ′ ( x st − x ′ s ) Q st y st + max y ′ x st Q st ( y ′ s − y st ) (cid:19) ≥ η (cid:16) max y ′ x st Q st y ′ s − min x ′ x ′ s Q st y st (cid:17) = η y ′ (cid:16)b x st +1 Q s⋆ y ′ s + x st ( Q st − Q s⋆ ) y ′ s + ( x st − b x st +1 ) Q s⋆ y ′ s (cid:17) − η x ′ (cid:16) x ′ s Q s⋆ b y st +1 + x ′ s ( Q st − Q s⋆ ) y st + x ′ Q s⋆ ( y st − b y st +1 ) (cid:17) ≥ η x ′ ,y ′ (cid:16)b x st +1 Q s⋆ y ′ s − x ′ s Q s⋆ b y st +1 (cid:17) − η Γ t − η − γ ) (cid:0) k b x st +1 − x st k + k b y st +1 − y st k (cid:1) ( k Q s⋆ k ≤ − γ ) ≥ η st +1 − η Γ t − η − γ ) k b z st +1 − z st k (by the deﬁnition of ∆ st +1 ) ≥ η st +1 − η Γ t − k b z st +1 − z st k . (25)Then notice that we have θ st +1 + η Γ t + 2 η ε ≥ k b z st +1 − z st k + k z st − b z st k + η Γ t η ε (by the deﬁnition of θ st +1 and that η Γ t ≤ η − γ ≤ ) ≥ (cid:18) k b z st +1 − z st k + k z st − b z st k + η Γ t √ ηε (cid:19) (Cauchy-Schwarz inequality) ≥ η (cid:0) ∆ st +1 (cid:1) . (by Eq. (25) and notice that ∆ st +1 ≥ ) Lemma 26 (Key Lemma for Average Duality-gap Bounds)

For all t ≥ , we have dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) − θ st +1 + θ st − η

128 (∆ st +1 ) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

Combining Lemma 25 with Lemma 24, we get dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) − θ st +1 − θ st +1 + θ st + 4 η Γ t + 8 η k Q st − Q st − k + 6 ηε ≤ dist ⋆ ( b z st ) − θ st +1 − (cid:18) η (cid:0) ∆ st +1 (cid:1) − η Γ t − η ε (cid:19) + θ st + 4 η Γ t + 8 η k Q st − Q st − k + 6 ηε ≤ dist ⋆ ( b z st ) − θ st +1 + θ st − η (cid:0) ∆ st +1 (cid:1) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. ( ηε ≤ ) Lemma 27 (Key Lemma for Point-wise Convergence Bounds)

There exists a constant C ′ > (which depends on the transition and the loss/payoff functions) such that for all t ≥ , dist ⋆ ( b z st +1 ) + 4 . θ st +1 ≤

11 + η C ′ (dist ⋆ ( b z st ) + 4 . θ st ) + 5 η Γ t + 8 η k Q st − Q st − k − θ st + 7 ηε. Proof

By Theorem 5 of (Wei et al., 2021) or Lemma 3 of (Gilpin et al., 2012), we have ∆ st +1 ≥ C dist ⋆ ( b z st +1 ) for some problem-dependent constant < C ≤ − γ ( C depends on { Q s⋆ } s ). Thus Theorem 26implies dist ⋆ ( b z st +1 ) + 5 θ st +1 ≤ dist ⋆ ( b z st ) + θ st − η C

128 dist ⋆ ( b z st +1 ) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. By deﬁning C ′ = C , we further get dist ⋆ ( b z st +1 ) + 51 + η C ′ θ st +1 ≤

11 + η C ′ (cid:0) dist ⋆ ( b z st ) + θ st + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε (cid:1) ≤

11 + η C ′ (dist ⋆ ( b z st ) + θ st ) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. Notice that η C ′ ≥ × ≥ . . Thus we further have dist ⋆ ( b z st +1 ) + 4 . θ st +1 ≤

11 + η C ′ (dist ⋆ ( b z st ) + 4 . θ st ) + 5 η Γ t + 8 η k Q st − Q st − k − θ st + 7 ηε where in the last inequality we use η C ′ ≤ . η C ′ − because η C ′ ≤ × . Appendix D. Proof for Step 3: Bounding k Q st − Q st − k Lemma 28

We have for t ≥ and all s ∈ S , k Q st − Q st − k ≤ γ (1 − γ ) J t − + 2 γ γ K t − + 16 γ ε − γ . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

It is equivalent to prove that for all t ≥ , k Q st +1 − Q st k ≤ γ (1 − γ ) J t + 2 γ γ K t + 16 γ ε − γ . By deﬁnition, Q st ( a, b ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − i , we have k Q st +1 − Q st k = max a,b ( Q st +1 ( a, b ) − Q st ( a, b )) ≤ γ max s ′ (cid:16) V s ′ t − V s ′ t − (cid:17) (26)Now it sufﬁces to upper bound (cid:0) V st − V st − (cid:1) for any s . By Corollary 22, we have V st − = P t − τ =1 α τt − ρ sτ .Therefore, V st − V st − = α t (cid:0) ρ st − V st − (cid:1) = α t ρ st − t − X τ =0 α τt − ρ sτ ! = α t t − X τ =0 α τt − ( ρ st − ρ sτ ) ! (because P t − τ =0 α τt − = 1 )In the following calculation, we omit the superscript s for simplicity. By deﬁning diff h , | ρ h − ρ h − | , we have ( V t − V t − ) ≤ ( α t ) t − X τ =0 α τt − ( ρ t − ρ τ ) ! ≤ ( α t ) t − X τ =0 α τt − t X h = τ +1 ( ρ h − ρ h − ) ! ≤ ( α t ) t − X τ =0 α τt − t X h = τ +1 diff h ! = ( α t ) t X h =1 h − X τ =0 α τt − diff h ! ≤ ( α t ) t X h =1 δ h − t − diff h ! . (by Lemma 35)Then we continue: ( V t − V t − ) ≤ ( α t ) t X h =1 δ h − t − diff h ! AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES ≤ ( α t ) t X h =1 δ h − t − ! t X h =1 δ h − t − diff h ! (Cauchy-Schwarz inequality) ≤ t X h =1 α h δ h − t − ! t X h =1 α h δ h − t − diff h ! ( α t ≤ α h for h ≤ t ) ≤ t X τ =1 α τt diff τ (note that α h δ h − t − = α h Q t − τ = h (1 − α τ ) ≤ α h Q tτ = h +1 (1 − α τ ) = α ht ) = t X τ =1 α τt (cid:16) ρ t − x τ Q τ y τ + x τ ( Q τ − Q τ − ) y τ + ( x τ − x τ − ) Q τ − y τ + x τ − Q τ − ( y τ − y τ − ) + x τ − Q τ − y τ − − ρ t − (cid:17) ≤ t X τ =1 α τt (cid:18) ε − γ + 21 + γ k Q τ − Q τ − k + 8(1 − γ ) k x τ − x τ − k + 8(1 − γ ) k y τ − y τ − k + 8 ε − γ (cid:19) , where we use ( a + b + c + d + e ) ≤ − γ a + γ b + − γ c + − γ d + − γ e which is due toCauchy-Schwarz inequality. By Lemma 21 and the deﬁnitions of J st , K st , J t , K t in Deﬁnition 13and Deﬁnition 14, t X τ =1 α τt k Q sτ − Q sτ − k = K st ≤ K t , t X τ =1 α τt k z sτ − z sτ − k ≤ J st ≤ J t . Combining them with the previous upper bound for ( V st − V st − ) , we get ( V st − V st − ) ≤ − γ ) J t + 21 + γ K t + 16 ε − γ for all s . Further combining this with Eq. (26), we get k Q st +1 − Q st k ≤ γ (1 − γ ) J t + 2 γ γ K t + 16 γ ε − γ . Appendix E. Proof for Steps 4 and 5: Bounding k Q st − Q s⋆ k Lemma 29

For all t ≥ , Γ t ≤ γ t − X τ =1 α τt − Γ τ + Reg t − + ε ! . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

We proceed with Q st +1 ( a, b )= σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t i = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt ρ s ′ τ (Corollary 22) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt x s ′ τ Q s ′ τ y s ′ τ + ε (by the deﬁnition of ρ s ′ τ and that P tτ =1 α τt = 1 for t ≥ ) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt b x s ′ t⋆ Q s ′ τ y s ′ τ + Reg t + ε (by the deﬁnition of Reg t ) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt b x s ′ t⋆ Q s ′ ⋆ y s ′ τ + t X τ =1 α τt Γ τ + Reg t + ε (by the deﬁnition of Γ τ ) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt b x s ′ t⋆ Q s ′ ⋆ y s ′ ⋆ + γ t X τ =1 α τt Γ τ + Reg t + ε ! (by deﬁnition of y s ′ ⋆ ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ ⋆ i + γ t X τ =1 α τt Γ τ + Reg t + ε ! ( P tτ =1 α τt = 1 for t ≥ ) = Q s⋆ ( a, b ) + γ t X τ =1 α τt Γ τ + Reg t + ε ! . Similarly, Q st +1 ( a, b ) ≥ Q s⋆ ( a, b ) − γ t X τ =1 α τt Γ τ + Reg t + ε ! . They jointly imply Γ t +1 ≤ γ t X τ =1 α τt Γ τ + Reg t + ε ! . Lemma 30

For any state s and time t ≥ , Reg t ≤ η Z t + 4 η (1 − γ ) J t + 4 ηK t + 3 ε. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

Summing the ﬁrst bound in Lemma 23 over τ = 1 , . . . , t with weights α τt , and droppingnegative terms −k b x st +1 − x st k − k x st − b x t k , we get t X τ =1 α τt ( x sτ − b x sτ⋆ ) Q sτ y sτ ≤ t X τ =1 α τt η (cid:0) dist ⋆ ( b x sτ ) − dist ⋆ ( b x sτ +1 ) (cid:1) + 4 η (1 − γ ) t X τ =1 α τt k y sτ − y sτ − k + 4 η t X τ =1 α τt k Q sτ − Q sτ − k + 3 ε ≤ α t η dist ⋆ ( b x s ) + t X τ =2 α τt − α τ − t η dist ⋆ ( b x sτ ) + 4 η (1 − γ ) t X τ =1 α τt k y sτ − y sτ − k + 4 η t X τ =1 α τt k Q sτ − Q sτ − k + 3 ε ≤ α t η dist ⋆ ( b x s ) + t X τ =2 α τt − α τ − t η dist ⋆ ( b x sτ ) + 4 η (1 − γ ) J st + 4 ηK st + 3 ε. (27)Observe that by deﬁnition, we have for τ ≥ , α τt − α τ − t = α τt (cid:18) − α τ − (1 − α τ ) α τ (cid:19) = α τt × α τ − α τ − + α τ − α τ α τ ≤ α τ − α τt where in the inequality we use α τ ≤ α τ − . Using this in Eq. (27), we get t X τ =1 α τt ( x sτ − b x sτ ∗ ) Q sτ y sτ ≤ α t η dist ⋆ ( b x s ) + 12 η t X τ =2 α τt α τ − dist ⋆ ( b x sτ ) + 4 η (1 − γ ) J st + 4 ηK st + 3 ε = 12 η t X τ =1 α τt α τ − dist ⋆ ( b x sτ ) + 4 η (1 − γ ) J st + 4 ηK st + 3 ε (recall that α = 1 )Using J sn ≤ J n , K sn ≤ K n , and the deﬁnition of Z t , Reg t ﬁnishes the proof. Appendix F. Combining Lemmas to Show Last-iterate Convergence

In this section, we provide proofs for Theorem 1 and Theorem 2. To achieve so, we ﬁrst proveLemma 31 by combining the results in Appendix D and Appendix E. Then we combine Theorem 26,Lemma 27, and Lemma 31 to prove Theorem 1 and Theorem 2.

Lemma 31

For any s and t ≥ , η Γ t + 8 η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) ≤ max s ′ C η (1 − γ ) t − X τ =1 β τt ( θ s ′ τ + θ s ′ τ +1 ) + C t − X τ =1 β τt α τ − dist ⋆ ( b z s ′ τ ) ! + 80 β t + 80 ηε (1 − γ ) . where C = 1152 × and C = 10 . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

By Lemma 28, for all t ≥ , η k Q st − Q st − k ≤ η γ (1 − γ ) J t − + 2 η γ γ K t − + 16 η ε − γ . (28)By Lemma 29 and Lemma 30, for all t ≥ , η Γ t ≤ γ t − X τ =1 α τt − η Γ τ + 4 η (1 − γ ) J t − + 4 η K t − + 12 Z t − + 4 ηε. (29)Now, multiply Eq. (29) with − γ , and then add it to Eq. (28). Then we get that for t ≥ , η k Q st − Q st − k + 1 − γ η Γ t ≤ γ t − X τ =1 α τt − (cid:18) − γ η Γ τ (cid:19) + (cid:18) γ (1 − γ ) + 14(1 − γ ) (cid:19) η J t − + (cid:18) γ γ + 1 − γ (cid:19) η K t − + (1 − γ ) Z t −

32 + ηε ≤ γ t − X τ =1 α τt − (cid:18) − γ η Γ τ (cid:19) + 9(1 − γ ) η J t − + γη K t − + (1 − γ ) Z t −

32 + ηε (see explanation below) ≤ γ t − X τ =1 α τt − (cid:18) − γ η Γ τ + η max s ′ k Q s ′ τ − Q s ′ τ − k (cid:19) + 9(1 − γ ) η J t − + (1 − γ ) Z t −

32 + ηε , where in the second inequality we use that γ γ + − γ − γ = (1 − γ ) (cid:16) − γ γ (cid:17) ≤ since γ ≥ ,and in the last inequality, we use K st − = P t − τ =1 α τt − k Q sτ − Q sτ − k . Deﬁne the new variable u t = η max s k Q st − Q st − k + 1 − γ η Γ t . Then the above implies that for all t ≥ , u t ≤ γ t − X τ =1 α τt − u τ + 9(1 − γ ) η J t − + (1 − γ ) Z t −

32 + ηε . (30)Observe that Eq. (30) is in the form of Lemma 33 with the following choices: g t = u t , ∀ t ≥ ,h t = ( u t + ηε for t = 1 η (1 − γ ) J t − + (1 − γ )32 Z t − + ηε for t ≥ and get that for t ≥ , u t ≤ η (1 − γ ) t X τ =2 β τt J τ − + 1 − γ t X τ =2 β τt Z τ − + β t u + ηε t X τ =1 β τt ≤ η (1 − γ ) t X τ =2 β τt J τ − + 1 − γ t X τ =2 β τt Z τ − + (1 − γ ) β t + ηε − γ (by Lemma 38) AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES because u t ≤ η (1 − γ ) + − γ ≤ − γ . Further using Lemma 34 on the ﬁrst two terms on the right-handside, and noticing that γ ≤ , we further get that for t ≥ , u t ≤ max s η (1 − γ ) t − X τ =1 β τt (cid:13)(cid:13) z sτ − z sτ − (cid:13)(cid:13) + 1 − γ t − X τ =1 β τt α τ − dist ⋆ ( b z sτ ) + (1 − γ ) β t + ηε − γ ≤ max s η (1 − γ ) t − X τ =1 β τt ( k z sτ − b z sτ k + k b z sτ − z sτ − k )+1 − γ t − X τ =1 β τt α τ − dist ⋆ ( b z sτ ) + (1 − γ ) β t + ηε − γ ≤ max s η (1 − γ ) t − X τ =1 β τt ( θ sτ +1 + θ sτ ) + 1 − γ t − X τ =1 β τt α τ − dist ⋆ ( b z sτ ) + (1 − γ ) β t + ηε − γ . (31)Finally, notice that according to the deﬁnition of u t , we have η Γ t + 8 η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) ≤ − γ u t .Combining Eq. (31), we ﬁnish the proof for case for t ≥ . The case for t = 1 is trivial since η Γ t + 8 η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) ≤ ≤

80 = 80 β . Proof of Theorem 1.

Deﬁne C α ( T ) , P Tt =1 α t and let C β be an upper bound of P ∞ t = τ β τt for any τ . With the choice of α t speciﬁed in the theorem, we have C α ( T ) = 1 + P Tt =1 H +1 H + t = O ( H log T ) = O (cid:16) log T − γ (cid:17) . By Lemma 40, we have C β ≤ − γ + 3 . Deﬁne S = |S| .Combining Lemma 31 and Theorem 26, we get that for t ≥ , η

128 (∆ st +1 ) ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) − θ st +1 + θ st + max s ′ C η (1 − γ ) t − X τ =1 β τt ( θ s ′ τ + θ s ′ τ +1 ) + C t − X τ =1 β τt α τ − dist ⋆ ( b z s ′ τ ) ! + 80 β t + 87 ηε (1 − γ ) . Summing the above over s ∈ S and t ∈ [ T − , and denoting Θ t = P s θ st , we get η T X t =1 X s (∆ st ) ≤ O ( S ) − T X t =1 t + C Sη (1 − γ ) T X t =1 t − X τ =1 β τt (Θ τ + Θ τ +1 )+ O S T X t =1 t − X τ =1 β τt α τ − + S T X t =1 β t + SηεT (1 − γ ) ! (32)since dist ⋆ ( b z sτ ) = O (1) and Θ = O ( S ) . Notice that the following hold T X t =1 t − X τ =1 β τt (Θ τ + Θ τ +1 ) ≤ T − X τ =1 T X t = τ β τt (Θ τ + Θ τ +1 ) ≤ C β T X τ =1 Θ τ , T X t =1 t − X τ =1 β τt α τ − ≤ T X τ =1 T X t = τ β τt α τ − ≤ C β T X τ =1 α τ − = C α ( T ) C β , AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES and P Tt =1 β t ≤ C β . Combining these three inequalities with Eq. (32), we get T X t =1 X s (∆ st ) = 128 η T X t =1 (cid:18) − C β C Sη (1 − γ ) (cid:19) Θ t + O (cid:18) SC α ( T ) C β η + SεTη (1 − γ ) (cid:19) = O (cid:18) SC α ( T ) C β η + SεTη (1 − γ ) (cid:19) . (by our choice of η , we have C β C Sη (1 − γ ) ≤ C Sη (1 − γ ) ≤ )By Cauchy-Schwarz inequality, we further have T X t =1 X s ∆ st ≤ √ ST T X t =1 X s (∆ st ) ! = O S p C α ( T ) C β Tη + ST √ ǫ √ η (1 − γ ) ! . Finally, by Lemma 32, we get T T X t =1 max s,x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) ≤ − γ T T X t =1 max s ∆ st = O S p C α ( T ) C β η (1 − γ ) √ T + S √ ǫ √ η (1 − γ ) ! = O (cid:18) S √ log Tη (1 − γ ) √ T + S √ ǫ √ η (1 − γ ) (cid:19) . Proof of Theorem 2.

Combining Lemma 27 and Lemma 31, we get that for all t ≥ , dist ⋆ ( b z st +1 ) + 4 . θ st +1 ≤

11 + η C ′ (cid:0) dist ⋆ ( b z st ) + 4 . θ st (cid:1) + max s ′ C η (1 − γ ) t − X τ =1 β τt ( θ s ′ τ + θ s ′ τ +1 ) + C t − X τ =1 β τt α τ − dist ⋆ ( b z s ′ τ ) ! + 80 β t − θ st + 87 ηε (1 − γ ) . Summing the above inequality over s ∈ S , and denoting L t = P s dist ⋆ ( b z st ) , Θ t = P s θ st , we getthat for all t , L t +1 + 4 . t +1 ≤

11 + η C ′ ( L t + 4 . t ) + C Sη (1 − γ ) t − X τ =1 β τt (Θ τ + Θ τ +1 )+ C S t − X τ =1 β τt α τ − L τ + 80 Sβ t − t + 87 Sηε (1 − γ ) . (33)The key idea of the following analysis is to use the negative (bonus) term − t to cancelthe positive (penalty) term C Sη (1 − γ ) P tτ =1 β τt Θ τ . Since the time indices do not match, we perform smoothing over time to help. Consider the following weighted sum of L τ + 4 . τ with weights λ τt +1 : t +1 X τ =2 λ τt +1 ( L τ + 4 . τ ) AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES = t X τ =1 λ τ +1 t +1 ( L τ +1 + 4 . τ +1 ) (re-indexing) ≤ t X τ =1 λ τt ( L τ +1 + 4 . τ +1 ) (Lemma 37) ≤

11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + C Sη (1 − γ ) t X τ =1 λ τt τ − X i =1 β iτ (Θ i + Θ i +1 )+ C S t X τ =1 λ τt τ − X i =1 β iτ α i − L i + 80 S t X τ =1 λ τt β τ − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤

11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + C Sη (1 − γ ) t − X i =1 t X τ = i λ τt β iτ ! (Θ i + Θ i +1 )+ C S t − X i =1 t X τ = i λ τt β iτ ! α i − L i + 80 S t X τ =1 λ τt β τ − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤

11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 3 C Sη (1 − γ ) t − X τ =1 λ τt (Θ τ + Θ τ +1 )+ 3 C S − γ t − X τ =1 λ τt α τ − L τ + 240 S − γ λ t − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt (by Lemma 36) ≤

11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 6 C Sη (1 − γ ) t X τ =1 λ τt Θ τ + 3 C S − γ t − X τ =1 λ τt α τ − L τ + 240 S − γ λ t − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤

11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 3 C S − γ t − X τ =1 λ τt α τ − L τ + 240 S − γ λ t + 87 Sηε (1 − γ ) t X τ =1 λ τt (by our choice of η , C Sη (1 − γ ) ≤ )where in the second-to-last inequality we use Lemma 41: with the special choice of α t speciﬁed inthe theorem, we have λ τt = α t ≤ λ τ +1 t for τ ≤ t − .Let t = min n τ : C S − γ α τ ≤ η C ′ o . Then we have t +1 X τ =2 λ τt +1 ( L τ + 4 . τ ) ≤ (cid:18)

11 + η C ′ + η C ′ (cid:19) t X τ =1 λ τt ( L τ + 4 . τ ) + 3 C S − γ min { t ,t } X τ =1 λ τt α τ − L τ + 240 S − γ λ t + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤

11 + 0 . η C ′ t X τ =3 λ τt ( L τ + 4 . τ ) + 12 C S − γ min { t ,t } X τ =1 λ τt α τ − + 240 S − γ λ t + 87 Sηε (1 − γ ) t X τ =1 λ τt . ( ηC ′ ≤ − according to Lemma 27, L τ ≤ S · max z,z ′ k z − z ′ k ≤ S ) AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Finally, we add λ t +1 ( L + 4 . ) to both sides, and note that λ t +1 ( L + 4 . ) = α t +1 ( L + 4 . ) ≤ α t ( L + 4 . ) ≤ α t · S = 22 Sλ t , where the ﬁrst and second equality is by Lemma 41. Then we get t +1 X τ =1 λ τt +1 ( L τ + 4 . τ ) ≤

11 + 0 . η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 12 C S − γ min { t ,t } X τ =1 λ τt α τ − + 240 S − γ λ t + 22 Sλ t + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤

11 + 0 . η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 274 C S − γ min { t ,t } X τ =1 λ τt α τ − + 87 Sηε (1 − γ ) t X τ =1 λ τt . Deﬁne Y t , t X τ =1 λ τt ( L τ + 4 . τ ) . Then we can further write that for t ≥ , Y t +1 ≤

11 + 0 . η C ′ Y t + 274 C S t − γ λ min { t ,t } t + 87 Sηε (1 − γ ) t X τ =1 λ τt . (upper bounding α τ − by )Applying Lemma 39 with c = . η C ′ . η C ′ , g t = Y t +1 , h t = C S t − γ λ min { t ,t } t + Sηε (1 − γ ) P tτ =1 λ τt ,we get Y t ≤ Y (1 + 0 . η C ′ ) − t + 20 . η C ′ C S t − γ + 87 Sηε (1 − γ ) sup t ′ ∈ [1 , t ] t ′ X τ =1 λ τt ! (1 + 0 . η C ′ ) − t + 20 . η C ′ C S t − γ sup t ′ ∈ [ t ,t ] λ min { t ,t ′ } t ′ + 87 Sηε (1 − γ ) sup t ′ ∈ [ t ,t ] t ′ X τ =1 λ τt ! . ( λ τt ≤ )With the choice of α t = H +1 H + t where H = − γ , we have t = Θ (cid:18) C S (1 − γ ) η C ′ ( H + 1) (cid:19) = Θ (cid:18) S (1 − γ ) η C ′ (cid:19) sup t ′ ∈ [1 ,t ] t ′ X τ =1 λ τt ≤ sup t ′ ∈ [1 ,t ] t ′ α t ≤ t ( H + 1) H + t ≤ H + 1 = O (cid:18) − γ (cid:19) (Lemma 41) sup t ′ ∈ [ t ,t ] λ min { t ,t ′ } t ′ = ( if t ≤ t α t else (Lemma 41) AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Combining them and noticing that (1 + 0 . η C ′ ) − t = O ( t ) when t ≥ η C ′ , we get that for t ≥ t = Θ (cid:16) S (1 − γ ) η C ′ (cid:17) , Y t = O (cid:18) S η C ′ (1 − γ ) α t + SεηC ′ (1 − γ ) (cid:19) = O (cid:18) S η C ′ (1 − γ ) t + SεηC ′ (1 − γ ) (cid:19) . Since Y t ≤ S + 22 S ( t − α t ≤ O ( S log t − γ ) , the above bound also trivially holds for t ≤ t .Then noticing that L t ≤ Y t ﬁnishes the proof. Lemma 32

For any policy pair x, y , the duality gap on the game can be related to the duality gapon individual states as follows: max s,x ′ ,y ′ (cid:0) V sx,y ′ − V sx ′ ,y (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . Proof

Notice that for any policy x and state s , max y ′ V sx,y ′ − V s⋆ = X a,b x s ( a ) y ′ s ( b ) Q sx,y ′ ( a, b ) − X a,b x s⋆ ( a ) y s⋆ ( b ) Q s⋆ ( a, b )= X a,b x s ( a ) y ′ s ( b ) (cid:0) Q sx,y ′ ( a, b ) − Q s⋆ ( a, b ) (cid:1) + X a,b (cid:0) x s ( a ) y ′ s ( b ) − x s⋆ ( a ) y s⋆ ( b ) (cid:1) Q s⋆ ( a, b )= γ X a,b x s ( a ) y ′ s ( b ) p ( s ′ | s, a, b ) (cid:16) V s ′ x,y ′ − V s ′ ⋆ (cid:17) + x s Q s⋆ y ′ s − x s⋆ Q s⋆ y s⋆ ≤ γ max s ′ ,y ′ (cid:16) V s ′ x,y ′ − V s ′ ⋆ (cid:17) + x s Q s⋆ y ′ s − x s⋆ Q s⋆ y s⋆ . Taking max over s on two sides and rearranging, we get max s,y ′ (cid:0) V sx,y ′ − V s⋆ (cid:1) ≤ − γ max s,y ′ (cid:0) x s Q s⋆ y ′ s − x s⋆ Q s⋆ y s⋆ (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . Similarly, max s,x ′ (cid:0) V s⋆ − V sx ′ ,y (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . Combining the two inequalities, we get max s,x ′ ,y ′ (cid:0) V sx,y ′ − V sx ′ ,y (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Appendix G. Auxiliary Lemmas

G.1. Interactions between the Auxiliary CoefﬁcientsLemma 33

Let { g t } t =1 , ,... , { h t } t =1 , ,... be non-negative sequences that satisfy g t ≤ γ P t − τ =1 α τt − g τ + h t for all t ≥ . Then g t ≤ P tτ =1 β τt h τ . Proof

We prove it by induction. When t = 1 , the condition guarantees g ≤ h = β h . Supposethat it holds for , . . . , t − . Then g t ≤ γ t − X τ =1 α τt − g τ + h t ≤ γ t − X τ =1 α τt − τ X i =1 β iτ h i ! + h t = t − X i =1 t − X τ = i γα τt − β iτ ! h i + h t It remains to prove that P t − τ = i γα τt − β iτ ≤ β it for all i ≤ t − . We use another induction to showthis. Fix i and t , and deﬁne the partial sum ζ r = P rτ = i γα τt − β iτ for r ∈ [ i, t − . Below we showthat ζ r ≤ α i r Y τ = i (1 − α τ + α τ γ ) t − Y τ = r +1 (1 − α τ ) . (34)Notice that the right-hand side above is β it when r = t − , which is exactly what we want to prove.When r = i , ζ r = γα it − = γα i Q t − τ = i +1 (1 − α τ ) ≤ α i (1 − α i + α i γ ) Q t − τ = i +1 (1 − α τ ) wherethe inequality is because − α i + α i γ − γ = (1 − α i )(1 − γ ) ≥ . Now assume that Eq. (34) holdsup to r for some r ≥ i . Then ζ r +1 = ζ r + β ir +1 γα r +1 t − ≤ α i r Y τ = i (1 − α τ + α τ γ ) t − Y τ = r +1 (1 − α τ ) + α i r Y τ = i (1 − α τ + α τ γ ) ! γα r +1 t − Y τ = r +2 (1 − α τ )= α i r +1 Y τ = i (1 − α τ + α τ γ ) t − Y τ = r +2 (1 − α τ ) . This ﬁnishes the induction.

Lemma 34

Let { h t } t =1 , ,... and { k t } t =1 , ,... be non-negative sequences that satisfy h t = P tτ =1 α τt k τ .Then P tτ =2 β τt h τ − ≤ γ P t − τ =1 β τt k τ . Proof

By the assumption on h τ , we have t X τ =2 β τt h τ − ≤ t X τ =2 β τt τ − X i =1 α iτ − k i ! = t − X i =1 t X τ = i +1 β τt α iτ − ! k i AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

It remains to prove that for i < t , P tτ = i +1 β τt α iτ − ≤ γ β it , or equivalently, γ P tτ = i +1 β τt α iτ − ≤ γ β it . Below we use another induction to prove this. Fix i and t , and deﬁne the partial sum ζ r = γ P tτ = r β τt α iτ − for r ∈ [ i + 1 , t ] . We will show that ζ r ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) . (35)For the base case r = t , we have ζ r = γβ tt α it − = γα i t − Y τ = i +1 (1 − α τ ) ≤ α i t − Y τ = i +1 (1 − α τ ) . Suppose that Eq. (35) holds up to r for some r ≤ t . Then ζ r − = ζ r + α ir − γβ r − t ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) + α i r − Y τ = i +1 (1 − α τ ) ! γα r − t − Y τ = r − (1 − α τ + α τ γ ) ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) ! × (1 − α r − + γα r − (1 − α r − + α r − γ )) ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) ! × (1 − α r − + α r − γ )= α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r − (1 − α τ + α τ γ ) . This ﬁnishes the induction. Applying the result with r = i + 1 , we get ζ i +1 ≤ α i t − Y τ = i +1 (1 − α τ + α τ γ ) = β it − α i + α i γ ≤ β it γ where the last inequality is by − α i + α i γ − γ = (1 − α i )(1 − γ ) ≥ . This ﬁnishes the proof. Lemma 35

For ≤ h ≤ t , P hτ =0 α τt = δ ht . Proof

We prove it by induction on h . When h = 0 , P hτ =0 α τt = α t = Q tτ =1 (1 − α τ ) = δ ht since α = 1 . Suppose that the formula holds for h . Then P h +1 τ =0 α τt = P hτ =0 α τt + α h +1 t = Q tτ = h +1 (1 − α τ )+ α h +1 Q tτ = h +2 (1 − α τ ) = Q tτ = h +2 (1 − α τ ) = δ h +1 t , which ﬁnishes the induction. Lemma 36

For any positive integers i, t with i ≤ t , P tτ = i λ τt β iτ ≤ − γ λ it . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

Notice that t X τ = i λ τt β iτ = λ it + t X τ = i +1 λ τt β iτ . (36)Below we use induction to prove t X τ = r λ τt β iτ ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ for r ∈ [ i + 1 , t ] . When r = t , P tτ = r λ τt β iτ = λ tt β it = β it = α i Q t − τ = i (1 − α τ (1 − γ )) . Suppose thisholds for some r ≤ t . Then t X τ = r − λ τt β iτ ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ + β ir − λ r − t ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ + α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17)! α r − t − Y τ = r − λ τ ≤ " α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ − γ (1 − α r − (1 − γ )) + α r − (cid:19) ( λ r − ≤ ) = " α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ − γ (cid:18) − α r − (1 − γ ) (cid:19)(cid:19) ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r − λ τ , ( λ r − ≥ − α r − (1 − γ ) by the deﬁnition of λ r − )which ﬁnishes the induction. Notice that this implies t X τ = i +1 λ τt β iτ ≤ − γ α i (1 − α i (1 − γ )) t − Y τ = i +1 λ τ ≤ − γ α i t − Y τ = i λ τ = 21 − γ λ it where the second inequality is by the deﬁnition of λ i . Thus, t X τ = i λ τt β iτ ≤ − γ λ it . (37) Lemma 37 λ τ +1 t +1 ≤ λ τt . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

When τ < t , we have λ τ +1 t +1 λ τt = α τ +1 Π ti = τ +1 λ i α τ Π t − i = τ λ i ≤ α τ +1 α τ λ τ ≤ where in the ﬁrst inequality we use λ t ≤ and in the second inequality we use the deﬁnition of λ τ .When τ = t , we have λ τ +1 t +1 λ τt = = 1 . Lemma 38 P tτ =1 β τt ≤ − γ . Proof

Below we use induction to prove that for all r = 1 , , . . . , t − , r X τ =1 β τt ≤ − γ t − Y i = r (1 − α i + α i γ ) . When r = 1 , the left-hand side is β t = α Q t − i =1 (1 − α i + α i γ ) ≤ − γ Q t − i =1 (1 − α i + α i γ ) , whichis the right-hand side.Suppose that this holds for r , then r +1 X τ =1 β τt = β r +1 t + r X τ =1 β τt ≤ α r +1 t − Y i = r +1 (1 − α i + α i γ ) + 11 − γ t − Y i = r (1 − α i + α i γ ) ≤ − γ t − Y i = r +1 (1 − α i + α i γ ) ! ( α r +1 (1 − γ ) + 1 − α r (1 − γ )) ≤ − γ t − Y i = r +1 (1 − α i + α i γ ) (because α r +1 ≤ α r )which ﬁnishes the induction.Therefore, P tτ =1 β τt = 1 + P t − τ =1 β τt ≤ − γ ≤ − γ . Lemma 39

Let { g t } t =0 , , ,... , { h t } t =1 , ,... be non-negative sequences that satisfy g t ≤ (1 − c ) g t − + h t for some c ∈ (0 , for all t ≥ . Then g t ≤ g (1 − c ) t + max τ ∈ [1 , t / ] h τ c (1 − c ) t + max τ ∈ [ t / ,t ] h τ c . Proof

We ﬁrst show that g t ≤ g (1 − c ) t + t X τ =1 (1 − c ) t − τ h τ . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

The case of t = 1 is clear. Suppose that this holds for g t . Then g t +1 ≤ (1 − c ) g (1 − c ) t + t X τ =1 (1 − c ) t − τ h τ ! + h t +1 = g (1 − c ) t +1 + t +1 X τ =1 (1 − c ) t +1 − τ h τ , which ﬁnishes the induction. Therefore, g t ≤ g (1 − c ) t + t / X τ =1 (1 − c ) t − τ max τ ∈ [1 , t / ] h τ + t X τ = t / +1 (1 − c ) t − τ max τ ∈ [ t / ,t ] h τ ≤ g (1 − c ) t + max τ ∈ [1 , t / ] h τ c (1 − c ) t + max τ ∈ [ t / ,t ] h τ c . G.2. Some Properties for the choice of α t = H +1 H + t Lemma 40

For the choice α t = H +1 H + t with H ≥ − γ , we have P ∞ t = τ β τt ≤ H + 3 . Proof

When t ≥ τ + 2 , β τt = α τ t − Y i = τ (1 − α i (1 − γ )) ≤ α τ t − Y i = τ (cid:18) − α i × H + 1 (cid:19) ( H + 1 ≥ − γ ) = α τ t − Y i = τ (cid:18) − H + i (cid:19) = H + 1 H + τ × H + τ − H + τ × H + τ − H + τ + 1 × · · · × H + t − H + t − H + 1 H + τ × ( H + τ − H + τ − H + t − H + t − H + 1 H + τ ( H + τ − H + τ − (cid:18) H + t − − H + t − (cid:19) ≤ ( H + 1)( H + τ − (cid:18) H + t − − H + t − (cid:19) . Therefore, ∞ X t = τ +2 β τt ≤ ( H + 1)( H + τ − × H + τ ≤ H + 1 , and thus P ∞ t = τ β τt ≤ H + 3 . AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Lemma 41

For the choice α t = H +1 H + t with H ≥ − γ , we have λ τt = α t for τ < t . Proof

With this choice of α t , λ t = max (cid:26) H + tH + t + 1 , − − γ × H + 1 H + t (cid:27) = max (cid:26) − H + t + 1 , − − γ × H + 1 H + t (cid:27) . By the condition, we have − γ × H +1 H + t ≥ H + t ≥ H + t +1 . Therefore, λ t = H + tH + t +1 = α t +1 α t . Thusfor τ < t , λ τt = α τ t − Y i = τ α i +1 α i = α t . Appendix H. Analysis on Sample Complexity

H.1. Proof of Theorem 3Proof

As long as we can make (cid:12)(cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ ε |A| (38)hold with high probability, then (cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st y st (cid:12)(cid:12) ≤ (cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st e y st (cid:12)(cid:12) + (cid:12)(cid:12) e ⊤ a Q st e y st − e ⊤ a Q st y st (cid:12)(cid:12) ≤ ε |A| + − γ × ε ′ |A| ≤ ε |A| , which implies k ℓ st − Q st y st k ≤ ε . We can similarly ensure k r st − Q s ⊤ t x st k ≤ ε and | ρ st − x s ⊤ t Q st y st | ≤ ε by the same way. Let N s,a , P Li =1 [ s i = s, a i = a ] be the number oftimes the ( s, a ) pair is visited. For a deterministic N , we can use Azuma-Hoeffding inequalityand know that Eq. (38) holds with probability − δ if N = Ω (cid:16) |A| ε log(1 /δ ) (cid:17) . However, N s,a israndom, so we cannot use Azuma-Hoeffding’s inequality directly. Let ( b (1) , s (1) ) , ( b (2) , s (2) ) , . . . bea sequence of independent random variables where b ( i ) ∼ e y st , s ( i ) ∼ p ( ·| s, a, b ( i ) ) , i = 1 , , . . . anddeﬁne e ℓ st,m = m P mi =1 ( σ ( s, a, b ( i ) ) + γV s ( i ) t − ) . It is direct to see that e ℓ st,m is an unbiased estimatorof e ⊤ a Q st e y st . Then by Azuma-Hoeffding’s inequality, we haveProb "(cid:12)(cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ O s log( L/δ ) N s,a ! ≤ Prob " ∃ m ∈ [ L ] , (cid:12)(cid:12)(cid:12)e ℓ st,m − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ O r log( L/δ ) m ! ≤ L X m =1 Prob "(cid:12)(cid:12)(cid:12)e ℓ st,m − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ O r log( L/δ ) m ! ≤ δ. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Therefore, with probability at least − δ , Eq. (38) holds if N s,a = Ω (cid:18) |A| ε log( L/δ ) (cid:19) . (39)Now it remains to determine L to make Eq. (39) hold with high probability. Note that by Assumption 1,we know that T s ′ → s e x t , e y t ≤ µ for any s ′ . Let T s,a be the distribution of random variable which is thenumber of rounds between the current state-action pair ( s ′ , a ′ ) and the next occurrence of ( s, a ) under strategy e x st and e y st . The mean of this distribution is t s,a ≤ |A| ε ′ µ ≤ |A| ε ′ µ . Then by Markovinequality, Prob (cid:20) the number of rounds before reaching ( s, a ) ≤ |A| ε ′ µ (cid:21) ≥ . Therefore, with probability at least − δL , within Θ( | A | ε ′ µ log( L/δ )) rounds, we reach ( s, a ) state-action at least pair once. Thus, Eq. (39) holds when L = Ω (cid:16) |A| ε ′ µε log ( L/δ ) (cid:17) with probability − δ . Solving L gives L = e Ω (cid:16) |A| (1 − γ ) µε log (1 /δ ) (cid:17) . The cases for r st ( b ) and ρ st are similar.Finally, using a union bound on all A , B , S , and all iterations, we know that with probability − δ , the ε -approximations are always guaranteed if we use the estimation above and take L = e Ω (cid:16) |A| + |B| (1 − γ ) µε log ( T /δ ) (cid:17) . H.2. Proof of Corollary 4Proof

From Theorem 1, we know that in order to show T P Tt =1 max s,x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) ≤ ξ ,it is sufﬁcient to show |S| η (1 − γ ) q log TT ≤ ξ and |S|√ ε √ η (1 − γ ) ≤ ξ . Solving these two inequalities, weget T = Ω (cid:16) |S| η (1 − γ ) ξ log |S| η (1 − γ ) ξ (cid:17) and ε = O (cid:16) η (1 − γ ) ξ |S| (cid:17) . Plugging ε into L in Theorem 3 givesthe required L . H.3. Proof of Corollary 5Proof

From Theorem 2, we know that in order to show |S| P s ∈S dist ⋆ ( b z sT ) ≤ ξ , it is sufﬁ-cient to show |S| η C (1 − γ ) T ≤ ξ and εηC (1 − γ ) ≤ ξ . Solving these two inequalities, we get T = Ω (cid:16) |S| η C (1 − γ ) ξ (cid:17) and ε = O (cid:0) ηC (1 − γ ) ξ (cid:1) . Plugging ε into L in Theorem 3 gives therequired L . Appendix I. Analysis on Rationalily

In this section, we analyze the rationality of our algorithm. First, we present the full pseudocodeof Algorithm 2, which is the single-player-perspective version of Algorithm 1, and then prove thatAlgorithm 2 achieves rationality. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Algorithm 2

Single-Player-Perspective Optimistic Gradient Descent/Ascent for Markov Games

Parameter : γ ∈ [ , , η ≤ q (1 − γ ) S , ε ∈ h , − γ i Parameters : a non-increasing sequence { α t } Tt =1 that goes to zero. Initialization : arbitrarily initialize b x s = x s ∈ ∆ A for all s ∈ S . V s ← for all s ∈ S . for t = 1 , . . . , T do For all s , deﬁne Q st ∈ R |A|×|B| as Q st ( a, b ) , σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − i , and update b x st +1 = Π ∆ A nb x st − ηℓ st o , (40) x st +1 = Π ∆ A nb x st +1 − ηℓ st o , (41) V st = (1 − α t ) V st − + α t ρ st , (42)where ℓ st and ρ st are ε -approximations of Q st y s and x s ⊤ t Q st y s , respectively. endI.1. Single-Player-Perspective Version of Algorithm 1I.2. Analysis of Algorithm 2 In this section, we prove Theorem 6, which shows the rationality of Algorithm 2. We call theoriginal game

Game 1 and construct a two-player Markov game

Game 2 with the difference be-ing that Player 2 has only one single action (call it ) on each state, the loss function is rede-ﬁned as σ ( s, a,

1) = E b ∼ y s [ σ ( s, a, b )] , and the transition kernel is redeﬁned as p ( s ′ | s, a,

1) = E b ∼ y s [ p ( s ′ | s, a, b )] . Correspondingly, we deﬁne Q st ( a,

1) = σ ( s, a,

1) + γ E s ′ ∼ p ( ·| s,a, h V s ′ t − i , b x st +1 = Π ∆ A { b x st − ηℓ st } ,x st +1 = Π ∆ A (cid:8)b x st +1 − ηℓ st (cid:9) ,V st = (1 − α t ) V st − + α t ρ st , where V s = 0 for all s , and ℓ st and ρ st are the same as in Algorithm 2. Clearly, the sequences { b x t , x t } t ∈ [ T ] and { b x t , x t } t ∈ [ T ] are exactly the same (assuming their initializations are the same,that is, b x = b x and x = x ). In the following lemma, we show that ℓ st and ρ st are indeed ε -approximation of Q st ( · , and x s ⊤ t Q st ( a, · ) , which then implies that the sequence { b x t } t ∈ [ T ] isindeed the output of our Algorithm 1 for Game 2 (note that we can think of Player 2 executingAlgorithm 1 in

Game 2 as well since she only has one unique strategy). Realizing that X s⋆ for Game 2 is exactly the set of best responses of y s , we can thus conclude that Theorem 6 is a directcorollary of Theorem 1 and Theorem 2. Lemma 42

For all t and s , ℓ st and ρ st are ε -approximation of Q st ( · , and x s ⊤ t Q st ( a, · ) respectively. AST - ITERATE C ONVERGENCE OF

OGDA IN I NFINITE - HORIZON M ARKOV G AMES

Proof

We prove the result together with V st = V st , Q st ( · ,

1) = Q st y s for all t ∈ [ T ] by induction.When t = 1 , V st = V st clearly holds. In addition, Q s ( a, · ) y s = E b ∼ y s [ σ ( s, a, b )] = Q s ( a, .Therefore ℓ s and ρ s are indeed ε -approximation of Q s ( a, · ) and x s ⊤ Q st ( a, · ) .Suppose that the claim holds at t . By deﬁnition and the inductive assumption, we have Q st +1 ( a,

1) = σ ( s, a,

1) + γ E s ′ ∼ p ( ·| s,a, h V s ′ t i = E b ∼ y s h σ ( s, a, b ) + γ E s ′ ∼ p ( s ′ | s,a,b ) h V s ′ t ii = Q st +1 ( a, · ) y s , which also shows that ℓ st +1 and ρ st +1 are indeed ε -approximation of Q st +1 ( · , and x s ⊤ t +1 Q st ( a, · ) (recall x st +1 = x st +1 ). By deﬁnition of V st +1 , we also have V st +1 = (1 − α t +1 ) V st + α t ρ st =(1 − α t ) V st + α t ρ st = V st +1 , which ﬁnishes the induction., which ﬁnishes the induction.