Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games
aa r X i v : . [ c s . L G ] F e b Last-iterate Convergence of Decentralized Optimistic GradientDescent/Ascent in Infinite-horizon Competitive Markov Games
Chen-Yu Wei
CHENYU . WEI @ USC . EDU
Chung-Wei Lee * LEECHUNG @ USC . EDU
Mengxiao Zhang * MENGXIAO . ZHANG @ USC . EDU
Haipeng Luo
HAIPENGL @ USC . EDU
University of Southern California
Abstract
We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentral-ized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithmis based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn thepolicies, with a critic that slowly learns the value of each state. To the best of our knowledge, thisis the first algorithm in this setting that is simultaneously rational (converging to the opponent’sbest response when it uses a stationary policy), convergent (converging to the set of Nash equilibriaunder self-play), agnostic (no need to know the actions played by the opponent), symmetric (play-ers taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.
1. Introduction
Multi-agent reinforcement learning studies how multiple agents should interact with each other andthe environment, and has wide applications in, for example, playing board games (Silver et al.,2017) and real-time strategy games (Vinyals et al., 2019). To model these problems, the frameworkof Markov games (also called stochastic games) (Shapley, 1953) is often used, which can be seenas a generalization of Markov Decision Processes (MDPs) from a single agent to multiple agents.In this work, we focus on one fundamental class: two-player zero-sum Markov games.In this setting, there are many centralized algorithms developed in a line of recent works withnear-optimal sample complexity for finding a Nash equilibrium (Wei et al., 2017; Sidford et al.,2020; Xie et al., 2020; Bai and Jin, 2020; Zhang et al., 2020; Liu et al., 2020b). These algorithmsrequire a central controller that collects some global knowledge (such as the actions and the rewardsof all players) and then jointly decides the policies for all players. Centralized algorithms are usually convergent (as defined in (Bowling and Veloso, 2001)), in the sense that the policies of the playersconverge to the set of Nash equilibria.On the other hand, there is also a surge of studies on decentralized algorithms that run inde-pendently on each player, requiring only local information such as the player’s own action and thecorresponding reward feedback (Zhang et al., 2019b; Bai et al., 2020; Tian et al., 2020; Liu et al.,2020a; Daskalakis et al., 2020). Compared to centralized ones, decentralized algorithms are usuallymore versatile and can potentially run in different environments (cooperative or competitive). Manyof them enjoy the property of being rational (as defined in (Bowling and Veloso, 2001)), in the * Equal contribution. © C.-Y. Wei, C.-W. Lee, M. Zhang & H. Luo.
AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES sense that a player’s policy converges to the best response to the opponent no matter what stationarypolicy the opponent uses. However, it is also often more challenging to show the convergence to aNash equilibrium when the two players execute the same decentralized algorithm.Our main contribution is to develop the first decentralized algorithm that is simultaneously rational , last-iterate convergent (with a concrete finite-time guarantee), agnostic , and symmet-ric (more details to follow in Section 1.1). Our algorithm is based on Optimistic Gradient De-scent/Ascent (OGDA) (Chiang et al., 2012; Rakhlin and Sridharan, 2013) and importantly relies ona critic that slowly learns a certain value function for each state. Following previous works onlearning MDPs (Abbasi-Yadkori et al., 2019; Agarwal et al., 2019) or Markov games (Perolat et al.,2018), we present the convergence guarantee in terms of the number of iterations of the algorithmand the estimation error of some gradient information (along with other problem-dependent con-stants), where the estimation error can be zero in a full-information setting, or goes down to zerofast enough with additional structural assumptions (e.g. every stationary policy pair induces anirreducible Markov chain, similar to (Auer and Ortner, 2007)).While the OGDA algorithm, first studied in (Popov, 1980) under a different name, has beenextensively used in recent years for learning matrix games (a special case of Markov games withone state), to the best of our knowledge, no previous work has applied it to learning Markov gamesand derived a concrete last-iterate convergence rate. Several recent works derive last-iterate con-vergence of OGDA for matrix games (Hsieh et al., 2019; Liang and Stokes, 2019; Mokhtari et al.,2019; Golowich et al., 2020; Wei et al., 2021), and our analysis is heavily inspired by the approachof (Wei et al., 2021). However, the extension to Markov games is highly non-trivial as there isadditional “instability penalty” in the system that we need to handle; see Section 4 for detaileddiscussions. In this section, we discuss and compare related works on learning two-player zero-sum Markovgames. We refer the readers to a thorough survey by (Zhang et al., 2019a) for other topics in multi-agent reinforcement learning.Shapley (1953) first introduces the Markov game model and proposes an algorithm analogous tovalue iteration for solving two-player zero-sum Markov games (with all parameters known). Later,Hoffman and Karp (1966) propose a policy iteration algorithm, and Pollatschek and Avi-Itzhak (1969)propose another policy iteration variant that works better in practice but cannot always converge.With the efforts of Van Der Wal (1978) and Filar and Tolwinski (1991), a slight variant of the(Pollatschek and Avi-Itzhak, 1969) algorithm is proposed in (Filar and Tolwinski, 1991) and provento converge. In such a full-information setting where all parameters are know, our algorithm has noestimation error and can also be viewed as a new policy-iteration algorithm.Littman (1994) initiates the study of competitive reinforcement learning under the frameworkof Markov games and proposes an extension of the single-player Q-learning algorithm, calledminimax-Q, which is later proven to converge under some conditions (Szepesv´ari and Littman,1999). While minimax-Q can run in a decentralized manner, it is conservative and only convergesto the minimax policy but not the best response to the opponent.To fix this issue, the work of Bowling and Veloso (2001) argues that a desirable multi-agentlearning algorithm should have the following two properties simultaneously: rational and conver-gent . By their definition, a rational algorithm converges to its opponent’s best response if the oppo- AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES nent converges to a stationary policy, while a convergent algorithm converges to a Nash equilibriumif both agents use it. They propose the WoLF (Win-or-Learn-Fast) algorithm to achieve this goal,albeit only with empirical evidence. Subsequently, Conitzer and Sandholm (2007); Perolat et al.(2018); Sayin et al. (2020) design decentralized algorithms that provably enjoy these two proper-ties, but only with asymptotic guarantees.Recently, there is a surge of works that provide finite-time guarantees and characterize the tightsample complexity for finding Nash equilibria (Perolat et al., 2015; P´erolat et al., 2016; Wei et al.,2017; Sidford et al., 2020; Xie et al., 2020; Zhang et al., 2020; Bai and Jin, 2020; Liu et al., 2020b).These algorithms are all essentially centralized. Below, we focus on comparisons with several recentworks that propose decentralized algorithms and provide finite-time guarantees. Comparison with R-Max (Brafman and Tennenholtz, 2002), UCSG-online (Wei et al., 2017)and OMNI-VI-online (Xie et al., 2020)
These algorithms, like minimax-Q, converge to the min-imax policy instead of the best response to the opponent, even when the opponent is weak (i.e., notusing its best policy). In other words, these algorithms are not rational. Another drawback of thesealgorithms is that the learner has to observe the actions taken by the opponent. Our algorithm, onthe other hand, is both rational and agnostic to what the opponent plays.
Comparison with Optimistic Nash V-Learning (Bai et al., 2020; Tian et al., 2020)
The NashV-Learning algorithm handles the finite-horizon tabular case. It runs an exponential-weight algo-rithm on each state, with importance-weighted loss/reward estimators. Since the exponential-weightalgorithm is known to diverge even in matrix games (Bailey and Piliouras, 2018), the iterate of NashV-Learning also diverges. After training, however, Nash V-Learning can output a near-optimal non-Markovian policy with size linear in the training time. In contrast, our algorithm exhibits last-iterateconvergence, and the output is a simple Markovian policy.
Comparison with Smooth-FSP (Liu et al., 2020a)
The Smooth-FSP algorithm handles the func-tion approximation setting. The objective function it optimizes is the original objective plus an en-tropy regularization term. Because of this additional regularization, the players are only guaranteedto converge to some neighborhood of the minimax policy pair (with a constant radius), even whentheir gradient estimation error is zero. In contrast, our algorithm converges to the true minimaxpolicy pair when the gradient estimation error goes to zero.
Comparison with Independent PG (Daskalakis et al., 2020)
This work studies independentpolicy gradient in the tabular case. To achieve last-iterate convergence, the two players have touse asymmetric learning rates, and only the one with a smaller learning rate converges to the min-imax policy. In contrast, the two players of our algorithm are completely symmetric, and theysimultaneously converge to the equilibrium set.
2. Preliminaries
We consider a two-player zero-sum discounted Markov game defined by a tuple ( S , A , B , σ, p, γ ) ,where: 1) S is a finite state space; 2) A and B are finite action spaces for Player 1 and Player2 respectively; 3) σ is the loss (payoff) function for Player 1 (Player 2), with σ ( s, a, b ) ∈ [0 , specifying how much Player 1 pays to Player 2 if they are at state s and select actions a and b
1. It is tempting to consider an even stronger rationality notion, that is, having no regret against an arbitrary opponent.This is, however, known to be computationally hard (Radanovic et al., 2019; Bai et al., 2020). AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES respectively; 4) p : S×A×B → ∆ S is the transition function, with p ( s ′ | s, a, b ) being the probabilityof transitioning to state s ′ after actions a and b are taken by the two players respectively at state s ( ∆ S denotes the set of probability distributions over S ); 5) and ≤ γ < is a discount factor. A stationary policy of Player 1 can be described by a function
S → ∆ A that maps each state toan action distribution. We use x s ∈ ∆ A to denote the action distribution for Player 1 on state s , anduse x = { x s } s ∈S to denote the complete policy. We define y s and y = { y s } s ∈S similarly for Player2. For notational convenience, we further define z s = ( x s , y s ) ∈ ∆ A × ∆ B as the concatenatedpolicy of the players on state s , and let z = { z s } s ∈S .For a pair of stationary policies ( x, y ) and an initial state s , the expected discounted value thatthe players pay/gain can be represented as V sx,y = E " ∞ X t =1 γ t − σ ( s t , a t , b t ) (cid:12)(cid:12)(cid:12)(cid:12) s = s, a t ∼ x s t , b t ∼ y s t , s t +1 ∼ p ( ·| s t , a t , b t ) , ∀ t ≥ . The minimax game value on state s is then defined as V s⋆ = min x max y V sx,y = max y min x V sx,y . It is known that a pair of stationary policies ( x ⋆ , y ⋆ ) attaining the minimax value on state s isnecessarily attaining the minimax value on all states (Filar and Vrieze, 2012), and we call such x ⋆ a minimax policy, such y ⋆ a maximin policy, and such pair a Nash equilibrium. Further define X s⋆ = { x s⋆ ∈ x ⋆ : x ⋆ is a minimax policy } and similarly Y s⋆ = { y s⋆ ∈ y ⋆ : y ⋆ is a maximin policy } ,and denote Z s⋆ = X s⋆ × Y s⋆ . It is also known that any x = { x s } s ∈S with x s ∈ X s⋆ for all s is aminimax policy (similarly for y ) (Filar and Vrieze, 2012).For any x s , we denote its distance from X s⋆ as dist ⋆ ( x s ) = min x s⋆ ∈X s⋆ k x s⋆ − x s k , where k v k fora vector v denotes its L norm throughout the paper; similarly, dist ⋆ ( y s ) = min y s⋆ ∈Y s⋆ k y s⋆ − y s k and dist ⋆ ( z s ) = min z s⋆ ∈Z s⋆ k z s⋆ − z s k = q dist ⋆ ( x s ) + dist ⋆ ( y s ) . The projection operator for aconvex set U is defined as Π U { v } = argmin u ∈U k u − v k .We also define the Q-function on state s under policy pair ( x, y ) as Q sx,y ( a, b ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ x,y i , which can be compactly written as a matrix Q sx,y ∈ R |A|×|B| such that V sx,y = x s ⊤ Q sx,y y s . We write Q s⋆ = Q sx ⋆ ,y ⋆ for any minimax/maximin policy pair ( x ⋆ , y ⋆ ) (which is unique even if ( x ⋆ , y ⋆ ) isnot). Finally, k Q k for a matrix Q is defined as max i,j | Q i,j | . Optimistic Gradient Descent Ascent (OGDA)
As mentioned, our algorithm is based on runningan instance of the OGDA algorithm on each state with an appropriate loss/reward function. Tothis end, here, following the exposition of (Wei et al., 2021) we briefly review OGDA for a matrixgame defined by a matrix Q ∈ R |A|×|B| . Specifically, OGDA maintains two sequences of actiondistributions b x , b x , . . . ∈ ∆ A and x , x , . . . ∈ ∆ A for Player 1, and similarly two sequences b y , b y , . . . ∈ ∆ B and y , y , . . . ∈ ∆ B for Player 2, following the updates below: b x t +1 = Π ∆ A (cid:8)b x t − ηQy t (cid:9) , x t +1 = Π ∆ A (cid:8)b x t +1 − ηQy t (cid:9) , b y t +1 = Π ∆ B (cid:8)b y t + ηQ ⊤ x t (cid:9) , y t +1 = Π ∆ B (cid:8)b y t +1 + ηQ ⊤ x t (cid:9) , (1)
2. The discount factor is usually some value close to , so we assume that it is no less than for simplicity.3. Note the slight abuse of notation here: the meaning of dist ⋆ ( · ) depends on its input. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES where η is some learning rate. As one can see, unlike the standard Gradient Descent Ascent al-gorithm which simply sets ( x t , y t ) = ( b x t , b y t ) , OGDA takes a further descent/ascent step usingthe latest gradient to obtain ( x t , y t ) , which is then used to evaluate the gradient (of the function f ( x, y ) = x ⊤ Qy ). Wei et al. (2021) prove that the iterate ( b x t , b y t ) (or ( x t , y t ) ) converges to the setof Nash equilibria of the matrix game at a linear rate, which motivates us to generalize it to Markovgames. As we show in the following sections, however, the extensions of both the algorithm and theanalysis are highly non-trivial.We remark that while Wei et al. (2021) also analyze the last-iterate convergence of another al-gorithm called Optimistic Multiplicative Weight Update (OMWU), which is even more commonlyused in finite-action games, they also show that the theoretical guarantees of OMWU hold undermore limited assumptions (e.g., requiring the uniqueness of the equilibrium), and its empirical per-formance is also inferior to that of OGDA. We therefore only extend the latter to Markov games.
3. Algorithm and Main Results
A natural idea to extend OGDA to Markov games is to run the same algorithm described in Section 2for each state s with the game matrix Q being Q sx t ,y t . However, an important difference is that nowthe game matrix is changing over time. Indeed, if the polices are changing rapidly for subsequentstates, the game matrix Q sx t ,y t will also be changing rapidly, which makes the update on state s highly unstable and in turn causes similar issues for previous states.To resolve this issue, we propose to have a critic slowly learn the value function for each state.Specifically, for each state s , the critic maintains a sequence of values V s = 0 , V s , V s , . . . . Duringiteration t , instead of using Q sx t ,y t as the game matrix for state s , we use Q st defined via Q st ( a, b ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) [ V s ′ t − ] . Ideally, OGDA would then take the role of an actor and compute x st +1 and b x st +1 using the gradient Q st y st (and similarly y st +1 and b y st +1 using the gradient Q s ⊤ t x st ).Since such exact gradient information is often unknown, we only require the algorithm to come upwith estimations ℓ st and r st such that k ℓ st − Q st y st k ≤ ε and k r st − Q s ⊤ t x st k ≤ ε for some prespecifiederror ε (more discussions in Section 3.1). See updates Eq. (2)-Eq. (5) in Algorithm 1. Note thatsimilar to (Wei et al., 2021), we adopt a constant learning rate η (independent of the number ofiterations) in these updates.At the end of each iteration t , the critic then updates the value function via V st = (1 − α t ) V st − + α t ρ st , where ρ st is an estimation of x s ⊤ t Q st y st such that | ρ st − x s ⊤ t Q st y st | ≤ ε . To stabilize the gamematrix, we require the learning rate α t to decrease in t and go to zero. Most of our analysis isconducted under this general condition, and the final convergence rate depends on the concreteform of α t , which we set to α t = H +1 H + t with H = − γ inspired by (Jin et al., 2018) (there could bea different choice leading to a better convergence though).Our main results are the following two theorems on the last-iterate convergence of Algorithm 1.
4. For simplicity, here we assume that the two players share the same estimator ρ st (and thus same V st and Q st ). However,our analysis works even if they maintain different versions of ρ st , as long as they are ε -close to x s ⊤ t Q st y st with respectto their own Q st . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Algorithm 1
Optimistic Gradient Descent/Ascent for Markov Games
Parameters : γ ∈ [ , , η ≤ q (1 − γ ) S , ε ∈ h , − γ i . Parameters : a non-increasing sequence { α t } Tt =1 that goes to zero. Initialization : ∀ s ∈ S , arbitrarily initialize b x s = x s ∈ ∆ A and b y s = y s ∈ ∆ B , and set V s ← . for t = 1 , . . . , T do For all s , define Q st ∈ R |A|×|B| as Q st ( a, b ) , σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − i , and update b x st +1 = Π ∆ A nb x st − ηℓ st o , (2) x st +1 = Π ∆ A nb x st +1 − ηℓ st o , (3) b y st +1 = Π ∆ B nb y st + ηr st o , (4) y st +1 = Π ∆ B nb y st +1 + ηr st o , (5) V st = (1 − α t ) V st − + α t ρ st , (6)where ℓ st , r st , and ρ st are ε -approximations of Q st y st , Q s ⊤ t x st , and x s ⊤ t Q st y st respectively, such that k ℓ st − Q st y st k ≤ ε , k r st − Q s ⊤ t x st k ≤ ε , and | ρ st − x s ⊤ t Q st y st | ≤ ε . endTheorem 1 (Average duality-gap convergence) Algorithm 1 with the choice of α t = H +1 H + t where H = − γ guarantees T T X t =1 max s,x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) = O |S| η (1 − γ ) r log TT + |S|√ ε √ η (1 − γ ) ! . Theorem 2 (Last-iterate convergence)
Algorithm 1 with the choice of α t = H +1 H + t where H = − γ guarantees with b z sT = ( b x sT , b y sT ) , |S| X s ∈S dist ⋆ ( b z sT ) = O (cid:18) |S| η C (1 − γ ) T + εηC (1 − γ ) (cid:19) , where C > is a problem-dependent constant (that always exists) satisfying: for all state s and allpolicy pair z = ( x, y ) , max x ′ ,y ′ ( x s Q s⋆ y ′ s − x ′ s Q s⋆ y s ) ≥ C dist ⋆ ( z s ) . Theorem 1 implies that max s,x ′ ,y ′ ( V s b x t ,y ′ − V sx ′ , b y t ) goes to zero (when both /T and ε go to zero),which in turn implies the convergence of b z t to the set of Nash equilibria, although without a concreterate. Theorem 2, on the other hand, shows a concrete finite-time convergence rate on the distance of b z sT from the equilibrium set, which goes down at the rate of /T up to the estimation error ε . Theproblem-dependent constant C is similar to the matrix game case analyzed in (Wei et al., 2021), as AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES we will discuss in Section 4. As far as we know, this is the first symmetric algorithm with finite-timelast-iterate convergence for both players simultaneously.
In the full-information setting where all parameters of the Markov game are given, we can calculatethe exact value of Q st y st , Q s ⊤ t x st , and x s ⊤ t Q st y st , making ε = 0 . In this case, our algorithm isessentially a new policy-iteration style algorithm for solving Markov games. However, in a learningsetting where the parameters are unknown, the players need to estimate these quantities based onany feedback from the environments. Here, we discuss how to do so when the players only observetheir current state and their loss/reward after taking an action.Specifically, in iteration t of our algorithm and with ( x t , y t ) at hand, the two players interact witheach other for a sequence of L steps, following a mixed strategy with a certain amount of uniformexploration defined via: e x st ( a ) = (cid:16) − ε ′ (cid:17) x st ( a ) + ε ′ |A| and e y st ( b ) = (cid:16) − ε ′ (cid:17) y st ( b ) + ε ′ |B| , where ε ′ = (1 − γ ) ε . This generates a sequence of observations { ( s i , a i , σ ( s i , a i , b i )) } Li =1 for Player 1and similarly a sequence of observations { ( s i , b i , σ ( s i , a i , b i )) } Li =1 for Player 2, where a i ∼ e x s i t , b i ∼ e y s i t , and s i +1 ∼ p ( ·| s i , a i , b i ) . Then we construct the estimators as follows: ℓ st ( a ) = P Li =1 [ s i = s, a i = a ] (cid:0) σ ( s, a, b i ) + γV s i +1 t − (cid:1)P Li =1 [ s i = s, a i = a ] , (7) r st ( b ) = P Li =1 [ s i = s, b i = b ] (cid:0) σ ( s, a i , b ) + γV s i +1 t − (cid:1)P Li =1 [ s i = s, b i = b ] , (8) ρ st = P Li =1 [ s i = s ] (cid:0) σ ( s, a i , b i ) + γV s i +1 t − (cid:1)P Li =1 [ s i = s ] . (9)(If any of the denominator is zero, define the corresponding estimator as zero.) To make surethat these are accurate estimators for every state, we naturally need to ensure that every state isvisited often enough. To this end, we make the following assumption similar to (Auer and Ortner,2007), which essentially requires that the induced Markov chain under any stationary policy pair isirreducible. Assumption 1
There exists µ > such that µ = max x,y max s,s ′ T s → s ′ x,y , where T s → s ′ x,y is theexpected time to reach s ′ from s following the policy pair ( x, y ) . Under this assumption, the following theorem shows that taking L ≈ /ε is enough to ensurethe accuracy of the estimators (see Appendix H for the proof). Theorem 3
If Assumption 1 holds and L = e Ω (cid:16) |A| + |B| (1 − γ ) µε log ( T /δ ) (cid:17) , then the estimators Eq. (7) ,Eq. (8) , and Eq. (9) ensure that with probability at least − δ , k ℓ st − Q st y st k , k r st − Q s ⊤ t x st k , and | ρ st − x s ⊤ t Q st y st | are all of order O ( ε ) for all t . Together with Theorem 1 and Theorem 2, given a fixed number of interactions between theplayers, we can now determine optimally how many iterations we should run our algorithm (and
5. We use e Ω to hide logarithmic factors except for T and /δ . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES consequently how large we should set ε ). Equivalently, we show below how many iterations or totalinteractions are need to achieve a certain accuracy. (The choice of α t is the same as in Theorem 1and Theorem 2.) Corollary 4
If Assumption 1 holds, then running Algorithm 1 with estimators Eq. (7) , Eq. (8) ,Eq. (9) and L = e Ω (cid:16) ( |A| + |B| ) |S| (1 − γ ) µη ξ log ( T /δ ) (cid:17) for T = e Ω (cid:16) |S| η (1 − γ ) ξ (cid:17) iterations ensures withprobability at least − δ , T P Tt =1 max s,x ′ ,y ′ ( V s b x t ,y ′ − V sx ′ , b y t ) ≤ ξ . Ignoring other dependence, thisrequires e Ω(1 /ξ ) interactions in total. Corollary 5
If Assumption 1 holds, then running Algorithm 1 with estimators Eq. (7) , Eq. (8) ,Eq. (9) and L = e Ω (cid:16) |A| + |B| (1 − γ ) µη C ξ log ( T /δ ) (cid:17) for T = Ω (cid:16) |S| η C (1 − γ ) ξ (cid:17) iterations ensures withprobability at least − δ , |S| P s ∈S dist ⋆ ( b z sT ) ≤ ξ . Ignoring other dependence, this requires e Ω(1 /ξ ) interactions in total. Finally, we argue that from the perspective of a single player (take Player 1 as an example), ouralgorithm is also rational, in the sense that it allows Player 1 to converge to the best response to heropponent if Player 2 is not applying our algorithm but instead uses an arbitrary stationary policy. We show this single-player-perspective version in Algorithm 2, where Player 1 still follows theupdates Eq. (2), Eq. (3), and Eq. (6), while y t is fixed to a stationary policy y used by Player 2.In fact, thanks to the agnostic nature of our algorithm, rationality is essentially an implicationof the convergence property. To see this, consider a modified two-player Markov game with thedifference being that the opponent has only a single action (call it ) on each state, the loss functionis redefined as σ ( s, a,
1) = E b ∼ y s [ σ ( s, a, b )] , and the transition kernel is redefined as p ( s ′ | s, a,
1) = E b ∼ y s [ p ( s ′ | s, a, b )] . It is straightforward to see that following our algorithm, Player 1’s behaviors inthe original game and in the modified game are exactly the same. On the other hand, in the modifiedgame, since Player 2 has only one action (and thus one strategy), she can also be seen as using ouralgorithm. Therefore, we can apply our convergent guarantees to the modified game, and since theminimax policy in the modified game is exactly the best response in the original game, we knowthat Player 1 indeed converges to the best response. We summarize these rationality guarantees inthe following theorem, with the formal proof deferred to Appendix I. Theorem 6
Algorithm 2 with the choice of α t = H +1 H + t where H = − γ guarantees T T X t =1 max s,x ′ (cid:16) V s b x t ,y s − V sx ′ ,y s (cid:17) = O |S| η (1 − γ ) r log TT + |S|√ ε √ η (1 − γ ) ! , and for X BR = n x : V sx,y = min x ′ V sx ′ ,y , ∀ s ∈ S o and some problem-dependent constant C ′ > , |S| X s ∈S k b x sT − Π X BR { b x sT }k = O (cid:18) |S| η C ′ (1 − γ ) T + εηC ′ (1 − γ ) (cid:19) .
6. The rationality defined by Bowling and Veloso (2001) requires that the learner converges to the best response as longas the opponent converges to a stationary policy. While our algorithm does handle this case, as a proof of concept,we only consider the simpler scenario where the opponent simply uses a stationary policy. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
4. Analysis Overview
In this section, we give an overview of how we analyze Algorithm 1 and prove Theorem 1 andTheorem 2. We start by giving a quick review of the analysis of (Wei et al., 2021) for matrix games,and then highlight how we overcome the challenges when generalizing it to Markov games.
Review for matrix games
Recall the update in Eq. (1) for a fixed matrix Q . Wei et al. (2021)show the following two convergence guarantees:1. Average duality-gap convergence: T T X t =1 ∆( b z t ) = O (cid:18) η √ T (cid:19) (10)where ∆( z ) = max x ′ ,y ′ (cid:0) x ⊤ Qy ′ − x ′⊤ Qy (cid:1) is the duality gap of z = ( x, y ) .2. Last-iterate convergence: dist ⋆ ( b z t ) ≤ C dist ⋆ ( b z ) (cid:0) η C (cid:1) − t (11)where dist ⋆ ( z ) is the distance from z to the set of equilibria, C is a universal constant, and C > is a positive constant that depends on Q . The analysis of (Wei et al., 2021) starts from the following single-step inequality that follows thestandard Online Mirror Descent analysis and describes the relation between dist ⋆ ( b z t +1 ) and dist ⋆ ( b z t ) : dist ⋆ ( b z t +1 ) ≤ dist ⋆ ( b z t ) + η k z t − z t − k | {z } instability penalty − (cid:16) k b z t +1 − z t k + k z t − b z t k (cid:17)| {z } instability bonus . (12)The instability penalty term makes dist ⋆ ( b z t +1 ) larger if k z t − z t − k is large, while the instabilitybonus term makes dist ⋆ ( b z t +1 ) smaller if either k b z t +1 − z t k or k z t − b z t k is large. To obtain Eq. (10),Wei et al. (2021) make the observation that the instability bonus term is lower bounded by a constanttimes the squared duality gap of b z t +1 , that is, k b z t +1 − z t k + k z t − b z t k & η ∆ ( b z t +1 ) , and thus dist ⋆ ( b z t +1 ) ≤ dist ⋆ ( b z t ) + η k z t − z t − k | {z } instability penalty − (cid:16) k b z t +1 − z t k + k z t − b z t k (cid:17)| {z } instability bonus − Ω( η ∆ ( b z t +1 )) . (13)By taking η ≤ , summing over t , canceling the penalty term with the bonus term, telescoping andrearranging, we get P Tt =1 ∆ ( b z t ) ≤ O (1 /η ) . An application of Cauchy-Schwarz inequality thenproves Eq. (10).To further obtain Eq. (11), Wei et al. (2021) prove that there exists some problem-dependentconstant C > such that for all z , ∆( z ) ≥ C dist ⋆ ( z ) . This, when combined with Eq. (13), shows dist ⋆ ( b z t +1 ) ≤ dist ⋆ ( b z t )1 + Ω( η C ) + η k z t − z t − k − Ω (cid:16) k b z t +1 − z t k + k z t − b z t k (cid:17) . (14)
7. This is not to be confused with the constant C in Theorem 2. We overload the notation because they indeed play thesame role in the analysis. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
By upper bounding k z t − z t − k ≤ k z t − b z t k + 2 k b z t − z t − k and rearranging, they further obtain: dist ⋆ ( b z t +1 ) + c ′ k b z t +1 − z t k + c ′ k z t − b z t k ≤ dist ⋆ ( b z t ) + c ′ k b z t − z t − k + c ′ k z t − − b z t − k η C ) (15)for some universal constant c ′ , which clearly indicates the linear convergence of dist ⋆ ( b z t ) and henceproves Eq. (11). Overview of our proofs
We are now ready to show the high-level ideas of our analysis. Forsimplicity, we consider the case with ε = 0 and also assume that there is a unique equilibrium ( x ⋆ , y ⋆ ) (these assumptions are removed in the formal proofs). Our analysis follows the steps below. Step 1 (Appendix B)
Similar to Eq. (12), we conduct a single-step analysis for OGDA in Markovgames (Lemma 24), which shows for all state s : dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) + η (cid:13)(cid:13) z st − z st − (cid:13)(cid:13) − (cid:16)(cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k (cid:17) + 8 η (cid:13)(cid:13) Q st − Q st +1 (cid:13)(cid:13) + 4 η k Q st − Q s⋆ k . (16)Comparing this with Eq. (12), we see that, importantly, since the game matrix Q st is changing overtime, we have two extra instability penalty terms: η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) and η k Q st − Q s⋆ k . Our hope isto further upper bound these two penalty terms by something related to k z st − z st +1 k , so that theycan again be canceled by the bonus term − ( (cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k ) . Indeed, in Steps 3-5, weshow that part of them can be bounded by a weighted sum of {k z s ′ τ − z s ′ τ +1 k } s ′ ∈S ,τ ≤ t . Step 2 (Appendix C): Lower bounding (cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k . As in Eq. (13), we aim tolower bound the instability bonus term by the duality gap. However, since the updates are based on Q st instead of Q s⋆ , we can only relate the bonus term to the duality gap with respect to Q st . To furtherrelate this to the duality gap with respect to Q s⋆ , we pay a quantity related to k Q st − Q s⋆ k . Formally,we show in Lemma 25: (cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k & Ω( η ∆ ( b z st +1 )) − O ( η k Q st − Q s⋆ k ) , where ∆( z s ) , max x ′ s ,y ′ s ( x s Q s⋆ y ′ s − x ′ s Q s⋆ y s ) is the duality gap on state s with respect to Q s⋆ . Step 3 (Appendix D): Upper bounding (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) . (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) is upper bounded by γ max s ′ ( V s ′ t − V s ′ t − ) by the definition of Q st . Furthermore, V s ′ t − V s ′ t − is a weighted sum of { ρ s ′ τ − ρ s ′ τ − } t − τ =1 by the definition of V s ′ t , and also ρ s ′ τ − ρ s ′ τ − = x s ′ τ Q s ′ τ y s ′ τ − x s ′ τ − Q s ′ τ − y s ′ τ − = O ( k z s ′ τ − z s ′ τ − k + k Q s ′ τ − Q s ′ τ − k ) . In sum, one can upper bound k Q st +1 − Q st k by a weighted sumof k z s ′ τ − z s ′ τ − k and k Q s ′ τ − Q s ′ τ − k . After formalizing the above relations, we obtain the followinginequality (see Lemma 28): (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) ≤ max s ′ γ (1 − γ ) t X τ =1 α τt k z s ′ τ − z s ′ τ − k + max s ′ γ γ t X τ =1 α τt k Q s ′ τ − Q s ′ τ − k (17)for some coefficient α τt defined in Appendix A.2. With recursive expansion, the above implies that (cid:13)(cid:13) Q st +1 − Q st (cid:13)(cid:13) can be upper bounded by a weighted sum of k z s ′ τ − z s ′ τ − k for s ′ ∈ S and τ ≤ t .
8. Similar to the notation dist ⋆ ( · ) , we also omit writing the s dependence for the function ∆( · ) . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Step 4 (Appendix E): Upper bounding k Q st − Q s⋆ k (Part 1). We first upper bound k Q st − Q s⋆ k with respect to the following weighted-regret quantityReg t , max s max ( t X τ =1 α τt ( x sτ − x s⋆ ) Q sτ y sτ , t X τ =1 α τt x sτ Q sτ ( y s⋆ − y sτ ) ) . To do so, we define Γ t = max s k Q st − Q s⋆ k and show for the same coefficient α τt mentioned earlier, V st = t X τ =1 α τt ρ sτ = t X τ =1 α τt x sτ Q sτ y sτ ≤ t X τ =1 α τt x s⋆ Q sτ y sτ + Reg t ≤ t X τ =1 α τt x s⋆ Q s⋆ y sτ + t X τ =1 α τt Γ τ + Reg t ≤ t X τ =1 α τt x s⋆ Q s⋆ y s⋆ + t X τ =1 α τt Γ τ + Reg t = V s⋆ + t X τ =1 α τt Γ τ + Reg t where the last inequality is by the fact P tτ =1 α τt = 1 . Using the definition of Q st again, we thenhave Q st +1 ( a, b ) − Q s⋆ ( a, b ) = γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − V s ′ ⋆ i ≤ γ ( P tτ =1 α τt Γ τ + Reg t ) . By the samereasoning, we can also show Q st +1 ( a, b ) − Q s⋆ ( a, b ) ≥ − γ ( P tτ =1 α τt Γ τ + Reg t ) , and therefore weobtain the following recursive relation (Lemma 29) Γ t +1 = max s k Q st +1 − Q s⋆ k ≤ γ t X τ =1 α τt Γ τ + Reg t ! . (18) Step 5 (Appendix E): Upper bounding k Q st − Q s⋆ k (Part 2). In this step, we further relate Reg t to {k z s ′ τ − z s ′ τ − k } τ ≤ t,s ′ ∈S . From a one-step regret analysis of OGDA, we have the following (forPlayer 1): ( x st − x s⋆ ) Q st y st ≤ η (cid:16) dist ⋆ ( b x st ) − dist ⋆ ( b x st +1 ) (cid:17) + 4 η (1 − γ ) k y st − y st − k + 4 η k Q st − Q st − k . Recall that Reg t is defined via a weighted sum of the left-hand side above with weights α τt . There-fore, we take the weighted sum of the above and bound P tτ =1 α τt ( x sτ − x s⋆ ) Q sτ y sτ by α t dist ⋆ ( b x s )2 η + t X τ =1 α τt η (cid:0) dist ⋆ ( b x sτ ) − dist ⋆ ( b x sτ +1 ) (cid:1) + 4 η (1 − γ ) t X τ =1 α τt k y sτ − y sτ − k + 4 η t X τ =1 α τt k Q sτ − Q sτ − k ≤ η t X τ =1 α τt α τ − dist ⋆ ( b z sτ ) | {z } term + 4 η (1 − γ ) t X τ =1 α τt k z sτ − z sτ − k | {z } term + 4 η t X τ =1 α τt k Q sτ − Q sτ − k | {z } term (19)where in the inequality we rearrange the first summation and use the fact α τt − α τ − t ≤ α τ − α τt (see the formal proof in Lemma 30). Since the case for P tτ =1 α τt x sτ Q sτ ( y s⋆ − y sτ ) is similar, by thedefinition of Reg t , we conclude that Reg t is upper bounded by the maximum over s of the sum ofthe three terms in Eq. (19). Note that, term is itself a weighted sum of {k z sτ − z sτ − k } τ ≤ t , and term can also be upper bounded by a weighted sum of {k z s ′ τ − z s ′ τ − k } τ ≤ t,s ′ ∈S as we alreadyshowed in Step 3. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Combining all steps.
Summing up Eq. (16) over all s , and based on all earlier discussions, wehave X s dist ⋆ ( b z st +1 ) ≤ X s dist ⋆ ( b z st ) + t X τ =1 X s µ sτ α τ − dist ⋆ ( b z sτ ) | {z } term + t X τ =1 X s ν sτ k z sτ − z sτ − k | {z } term − X s (cid:16)(cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k (cid:17)| {z } term − Ω η X s ∆ ( b z st +1 ) ! (20)for some weights µ sτ and ν sτ (a large part of the analysis is devoted to precisely calculating theseweights). Here, the − Ω (cid:0) η P s ∆ ( b z st +1 ) (cid:1) term comes from Step 2; term is a weighted sum of { α τ − dist ⋆ ( b z s ′ τ ) } τ ≤ t,s ′ ∈S that comes from term in Step 5; term is a weighed sum of {k z s ′ τ − z s ′ τ − k } τ ≤ t,s ′ ∈S that comes from all other terms we discuss in Steps 3-5. Obtaining average duality-gap bound
To obtain the average duality-gap bound in Theorem 1,we sum Eq. (20) over t , and further argue that the sum of term over t is smaller than the sum of term over t (hence they are canceled with each other). Rearranging and telescoping leads to η T X t =1 X s ∆ ( b z st +1 ) = O T X t =1 t X τ =1 X s µ sτ α τ − dist ⋆ ( b z sτ ) ! = O T X t =1 t X τ =1 X s µ sτ α τ − ! . As long as α t is decreasing and going to zero, the right-hand side above can be shown to be sub-linear in T . Further relating max x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) to ∆( b z st ) (Lemma 32) proves Theorem 1. Obtaining last-iterate convergence bound
Following the matrix game case, there is a problem-dependent constant
C > such that ∆( b z st +1 ) ≥ C dist ⋆ ( b z st +1 ) . Similarly to how Eq. (14) is ob-tained, we use this in Eq. (20) and arrive at X s dist ⋆ ( b z st +1 ) ≤
11 + Ω( η C ) X s dist ⋆ ( b z st ) + t X τ =1 X s µ sτ α τ − dist ⋆ ( b z sτ ) | {z } term + t X τ =1 X s ν sτ k z sτ − z sτ − k | {z } term − Ω X s (cid:16)(cid:13)(cid:13)b z st +1 − z st (cid:13)(cid:13) + k z st − b z st k (cid:17)| {z } term ! (21)Then ideally we would like to follow a similar argument from Eq. (14) to Eq. (15) to obtain a last-iterate convergence guarantee. However, we face two more challenges here. First, we have an extra term . Fortunately, this term vanishes when t is large as long as α t decreases and converges tozero. Second, in Eq. (14), the indices of the negative term k b z t +1 − z t k + k z t − b z t k and the positiveterm η k z t − z t − k are only offset by so that a simple rearrangement is enough to get Eq. (15),while in Eq. (21), the indices in term and term are far from each other. To address this issue, wefurther introduce a set of weights and consider a weighted sum of Eq. (21) over t . We then showthat the weighted sum of term can be canceled by the weighted sum of term . Combining theabove proves Theorem 2. Note that due to these extra terms, our last-iterate convergence rate isonly sublinear (while Eq. (11) shows a linear rate for matrix games). AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
5. Conclusion and Future Directions
In this work, we propose the first decentralized algorithm for two-player zero-sum Markov gamesthat is rational, convergent, agnostic, symmetric, and having a finite-time convergence rate guaranteeat the same time. The algorithm is based on running OGDA on each state, together with a slowlychanging critic that stabilizes the game matrix on each state.Our work studies the most basic tabular setting, and also requires a structural assumption whenestimation is needed that sidesteps the difficulty of performing exploration over the state space.Important future directions include relaxing either of these assumptions, that is, extending ourframework to allow function approximation and/or incorporating efficient exploration mechanisms.Studying OGDA-based algorithms beyond the two-player zero-sum setting is also an interestingfuture direction.
References
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gell´ertWeisz. Politex: Regret bounds for policy iteration using expert prediction. In
InternationalConference on Machine Learning , 2019.Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradi-ent methods: Optimality, approximation, and distribution shift. arXiv preprint arXiv:1908.00261 ,2019.Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcementlearning. In B. Sch¨olkopf, J. Platt, and T. Hoffman, editors,
Advances in Neural InformationProcessing Systems , 2007.Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. arXivpreprint arXiv:2002.04017 , 2020.Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play.
Advancesin Neural Information Processing Systems , 2020.James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In
Proceedings of the 2018 ACM Conference on Economics and Computation , 2018.Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. In
Proceedings of the 17th international joint conference on Artificial intelligence , 2001.Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning.
Journal of Machine Learning Research , 2002.Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, andShenghuo Zhu. Online optimization with gradual variations. In
Conference on Learning Theory ,2012.Vincent Conitzer and Tuomas Sandholm. Awesome: A general multiagent learning algorithm thatconverges in self-play and learns a best response against stationary opponents.
Machine Learning ,2007. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Constantinos Daskalakis, Dylan J. Foster, and Noah Golowich. Independent policy gradient meth-ods for competitive reinforcement learning. In
Advances in neural information processing sys-tems , 2020.Jerzy Filar and Koos Vrieze.
Competitive Markov decision processes . Springer Science & BusinessMedia, 2012.Jerzy A Filar and Boleslaw Tolwinski. On the algorithm of pollatschek and avi-ltzhak. 1991.Andrew Gilpin, Javier Pena, and Tuomas Sandholm. First-order algorithm with O (ln(1 /ǫ )) conver-gence for ǫ -equilibrium in two-person zero-sum games. Mathematical programming , 2012.Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence ratesfor no-regret learning in multi-player games.
Advances in neural information processing systems ,2020.Alan J Hoffman and Richard M Karp. On nonterminating stochastic games.
Management Science ,1966.Yu-Guan Hsieh, Franck Iutzeler, J´erˆome Malick, and Panayotis Mertikopoulos. On the convergenceof single-call stochastic extra-gradient methods. In
Advances in Neural Information ProcessingSystems , 2019.Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably effi-cient? In
Advances in Neural Information Processing Systems , 2018.Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local conver-gence of generative adversarial networks. In
The 22nd International Conference on ArtificialIntelligence and Statistics , 2019.Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In
Machine learning proceedings . 1994.Boyi Liu, Zhuoran Yang, and Zhaoran Wang. Policy optimization in zero-summarkov games: Fictitious self-play provably attains nash equilibria, 2020a. URL https://openreview.net/forum?id=c3MWGN_cTf .Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcementlearning with self-play. arXiv preprint arXiv:2010.01604 , 2020b.Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient andoptimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprintarXiv:1901.08511 , 2019.Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programmingfor two-player zero-sum markov games. In
International Conference on Machine Learning , 2015.Julien P´erolat, Bilal Piot, Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. Softened approx-imate policy iteration for markov games. In
International Conference on Machine Learning ,2016. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Julien Perolat, Bilal Piot, and Olivier Pietquin. Actor-critic fictitious play in simultaneous movemultistage games. In
International Conference on Artificial Intelligence and Statistics , 2018.MA Pollatschek and B Avi-Itzhak. Algorithms for stochastic games with geometrical interpretation.
Management Science , 1969.Leonid Denisovich Popov. A modification of the arrow-hurwicz method for search of saddle points.
Mathematical notes of the Academy of Sciences of the USSR , 1980.Goran Radanovic, Rati Devidze, David Parkes, and Adish Singla. Learning to collaborate in markovdecision processes. In
International Conference on Machine Learning , 2019.Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable se-quences. In
Advances in Neural Information Processing Systems , 2013.Muhammed O Sayin, Francesca Parise, and Asuman Ozdaglar. Fictitious play in zero-sum stochas-tic games. arXiv preprint arXiv:2010.04223 , 2020.Lloyd S Shapley. Stochastic games.
Proceedings of the national academy of sciences , 1953.Aaron Sidford, Mengdi Wang, Lin Yang, and Yinyu Ye. Solving discounted stochastic two-playergames with near-optimal time and sample complexity. In
International Conference on ArtificialIntelligence and Statistics , 2020.David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of gowithout human knowledge.
Nature , 2017.Csaba Szepesv´ari and Michael L Littman. A unified analysis of value-function-based reinforcement-learning algorithms.
Neural computation , 1999.Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Provably efficient online agnostic learningin markov games. arXiv preprint arXiv:2010.15020 , 2020.J Van Der Wal. Discounted markov games: Generalized policy iteration method.
Journal of Opti-mization Theory and Applications , 1978.Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha¨el Mathieu, Andrew Dudzik, Juny-oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmasterlevel in starcraft ii using multi-agent reinforcement learning.
Nature , 2019.Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In
Advances in Neural Information Processing Systems , 2017.Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergencein constrained saddle-point optimization.
International Conference on Learning Representations ,2021.Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. arXiv preprintarXiv:2002.07066 , 2020. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Kaiqing Zhang, Zhuoran Yang, and Tamer Bas¸ar. Multi-agent reinforcement learning: A selectiveoverview of theories and algorithms. arXiv preprint arXiv:1911.10635 , 2019a.Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Policy optimization provably converges to nashequilibria in zero-sum linear quadratic games. In
Advances in Neural Information ProcessingSystems , 2019b.Kaiqing Zhang, Sham M Kakade, Tamer Bas¸ar, and Lin F Yang. Model-based multi-agent rl inzero-sum markov games with near-optimal sample complexity. arXiv preprint arXiv:2007.07461 ,2020.
Appendix A. Notations
A.1. Simplifications of the Notations
We define the following notations to simplify the proofs:
Definition 7 b x s = x s = |A| (zero vector with dimension |A| ), b y s = y s = |B| , Q s = |A|×|B| , ℓ s = |A| , r s = |B| , ρ s = 0 , α = 1 . Besides, for a matrix Q , we define k Q k = max i,j | Q ij | . To avoid cluttered notation, a product ofthe form x ⊤ Qy is usually simply written as xQy . A.2. Auxiliary Coefficients
In this subsection, we define several coefficients that are related to the value learning rate { α t } . Definition 8 ( α τt ) For non-negative integers τ and t with τ ≤ t , define α τt = α τ Q ti = τ +1 (1 − α i ) . Definition 9 ( δ τt ) For non-negative integers τ and t with τ ≤ t , define δ τt , Q ti = τ +1 (1 − α i ) . Definition 10 ( β τt ) For positive integers τ and t with τ < t , define β τt = α τ Q t − i = τ (1 − α i + α i γ ) .Define β tt = 1 . Definition 11 ( λ t ) For positive integers t , define λ t = max n α t +1 α t , − α t (1 − γ )2 o . Definition 12 ( λ τt ) For positive integers τ and t with τ < t , define λ τt = α τ Q t − i = τ λ i . Define λ tt = 1 . A.3. Auxiliary Variables
In this subsection, we define several auxiliary variables to be used in the later analysis.
Definition 13 ( J st ) For every state s ∈ S , define the sequence { J st } t =1 , ,... by J s = k z s − z s k ,J st = (1 − α t ) J st − + α t (cid:13)(cid:13) z st − z st − (cid:13)(cid:13) , ∀ t ≥ . Furthermore, define J t , max s J st . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Definition 14 ( K st ) For every state s ∈ S , define the sequence { K st } t =1 , ,... by K s = k Q s − Q s k ,K st = (1 − α t ) K st − + α t (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) , ∀ t ≥ . Furthermore, define K t , max s K st . Definition 15 ( b x st⋆ , b y st⋆ , b z st⋆ ) Define b x st⋆ = Π X s⋆ ( b x st ) , i.e., the projection of b x st onto the set of optimalpolicy X s⋆ on state s . Similarly, b y st⋆ = Π Y s⋆ ( b y st ) , and b z st⋆ = Π Z s⋆ ( b z st ) = ( b x st⋆ , b y st⋆ ) . Definition 16 ( ∆ st ) Define ∆ st = max x ′ ,y ′ ( b x st Q s⋆ y ′ s − x ′ s Q s⋆ b y st ) for all t ≥ . Definition 17 (Reg st ) Define
Reg st = max ( t X τ =1 α τt ( x sτ − b x st⋆ ) Q sτ y sτ , t X τ =1 α τt x sτ Q sτ ( b y st⋆ − y sτ ) ) and Reg t = max s Reg st . Definition 18 ( Γ t ) Define Γ t = max s k Q st − Q s⋆ k . Definition 19 ( θ st ) Define θ st = k b z st − z st − k + k z st − − b z st − k Definition 20 ( Z t ) Define Z t = max s P tτ =1 α τt α τ − dist ⋆ ( b z sτ ) . A.4. Assumptions on α t and Simple Facts about α τt We require α t to satisfy the following:• α = 1 • < α t +1 ≤ α t ≤ • α t → as t → ∞ Furthermore, α , . Below is an useful lemma that is used in many places: Lemma 21 If { h t } t =0 , , ,... and { k t } t =1 , ,... are non-negative sequences that satisfy h t = (1 − α t ) h t − + α t k t for t ≥ , then h t = P tτ =1 α τt k τ . Proof
We prove it by induction. When t = 1 , since α = 1 , h = k = α k . Assume that theformula is correct for h t . Then h t +1 = (1 − α t +1 ) h t + α t +1 k t +1 = (1 − α t +1 ) t X τ =1 α τt k τ + α t +1 t +1 k t +1 = t X τ =1 α τt +1 k τ + α t +1 t +1 k t +1 = t +1 X τ =1 α τt +1 k τ . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Corollary 22
The following hold: • V st = P tτ =1 α τt ρ sτ • J st = P tτ =1 α τt (cid:13)(cid:13) z sτ − z sτ − (cid:13)(cid:13) • K st = P tτ =1 α τt (cid:13)(cid:13) Q sτ − Q sτ − (cid:13)(cid:13) Proof
They immediately follow from Lemma 21 and the definition of J st , K st , V st . Appendix B. Proof for Step 1: Single-Step Inequality
Lemma 23
For any state s and t , ( x st − b x st⋆ ) Q st y st ≤ η (cid:16) dist ⋆ ( b x st ) − dist ⋆ ( b x st +1 ) − k b x st +1 − x st k − k x st − b x st k (cid:17) + 4 η (1 − γ ) k y st − y st − k + 4 η k Q st − Q st − k + 3 ε,x st Q st ( b y st⋆ − y st ) ≤ η (cid:16) dist ⋆ ( b y st ) − dist ⋆ ( b y st +1 ) − k b y st +1 − y st k − k y st − b y st k (cid:17) + 4 η (1 − γ ) k x st − x st − k + 4 η k Q st − Q st − k + 3 ε. Proof
By standard proof of OGDA (see, e.g., the proof of Lemma 1 in (Wei et al., 2021) or Lemma1 in (Rakhlin and Sridharan, 2013)), we have ( x st − b x st⋆ ) ⊤ ℓ st ≤ η (cid:16) k b x st − b x st⋆ k − (cid:13)(cid:13)b x st +1 − b x st⋆ (cid:13)(cid:13) − k b x st +1 − x st k − k x st − b x st k (cid:17) + η k ℓ st − ℓ st − k . Since k b x st − b x st⋆ k = dist ⋆ ( b x st ) and (cid:13)(cid:13)b x st +1 − b x st⋆ (cid:13)(cid:13) ≥ dist ⋆ ( b x st +1 ) by the definition of dist ⋆ ( · ) , wefurther have ( x st − b x st⋆ ) ⊤ ℓ st ≤ η (cid:16) dist ⋆ ( b x st ) − dist ⋆ ( b x st +1 ) − k b x st +1 − x st k − k x st − b x st k (cid:17) + η k ℓ st − ℓ st − k . (22)By the definition of ℓ st , we have η (cid:13)(cid:13) ℓ st − ℓ st − (cid:13)(cid:13) ≤ η (cid:13)(cid:13) ℓ st − Q st y st + ( Q st − Q st − ) y st + Q st − ( y st − y st − ) + Q st − y st − − ℓ st − (cid:13)(cid:13) ≤ η k ℓ st − Q st y st k + 4 η (cid:13)(cid:13) ( Q st − Q st − ) y st (cid:13)(cid:13) + 4 η (cid:13)(cid:13) Q st − ( y st − y st − ) (cid:13)(cid:13) + 4 η (cid:13)(cid:13) Q st − y st − − ℓ st − (cid:13)(cid:13) ≤ η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) + 4 η (1 − γ ) (cid:13)(cid:13) y st − y st − (cid:13)(cid:13) + 8 ηε and ( x st − b x st⋆ ) Q st y st ≤ ( x st − b x st⋆ ) ℓ st + 2 ε. Combining them with Eq. (22) and the fact that ηε ≤ η − γ ≤ , we get the first inequality that wewant to prove. The other inequality is similar. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Lemma 24
For all t ≥ , dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) − θ st +1 + θ st + 4 η Γ t + 8 η k Q st − Q st − k + 6 ηε. Proof
Summing up the two inequalities in Lemma 23, we get η ( x st − b x st⋆ ) Q st y st + 2 ηx st Q st ( b y st⋆ − y st ) ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 4 η (1 − γ ) k z st − z st − k + 8 η k Q st − Q st − k − k b z st +1 − z st k − k z st − b z st k + 6 ηε ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 132 k z st − z st − k + 8 η k Q st − Q st − k − k b z st +1 − z st k − k z st − b z st k + 6 ηε ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 116 (cid:0) k z st − b z st k + k b z st − z st − k (cid:1) + 8 η k Q st − Q st − k − k b z st +1 − z st k − k z st − b z st k + 6 ηε = dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) + 8 η k Q st − Q st − k − k z st − b z st k − k b z st +1 − z st k + 116 k b z st − z st − k + 6 ηε The left-hand side above, can be lower bounded by η ( x st − b x st⋆ ) Q st y st + 2 ηx st Q st ( b y st⋆ − y st ) = 2 ηx st Q st b y st⋆ − η b x st⋆ Q st y st ≥ ηx st Q s⋆ b y st⋆ − η b x st⋆ Q s⋆ y st − η Γ t ≥ − η Γ t . (by the optimality of b x st⋆ and b y st⋆ )Combining the inequalities and using the definition of θ st finish the proof. Appendix C. Proof for Step 2: Lower Bounding k b z st +1 − z st k + k z st − b z st k Lemma 25
For all t ≥ , we have θ st +1 + η Γ t + 2 η ǫ ≥ η (cid:0) ∆ st +1 (cid:1) . Proof
By Eq. (2) and the optimality condition for b x st +1 , we have ( b x st +1 − b x st + ηℓ st ) · ( x ′ s − b x st +1 ) ≥ (23)for any x ′ s ∈ ∆ A . Then by the definition of ℓ st , ( b x st +1 − b x st + ηQ st y st ) · ( x ′ s − b x st +1 ) ≥ ( b x st +1 − b x st + ηℓ st ) · ( x ′ s − b x st +1 ) − ηε ≥ − ηε (24)where in the last inequality we use Eq. (23). Thus we have for any x ′ s ∈ ∆ A , √ k b x st +1 − x st k + k x st − b x st k ) ≥ √ k b x st +1 − b x st k≥ k b x st +1 − b x st k ≥ ( b x st +1 − b x st ) · ( x ′ s − b x st +1 ) ≥ η ( b x st +1 − x ′ s ) Q st y st − ηε (by Eq. (24)) = η ( x st − x ′ s ) Q st y st + η ( b x st +1 − x st ) Q st y st − ηε ≥ η ( x st − x ′ s ) Q st y st − η k b x st +1 − x st k − γ − ηε. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Using the fact that η − γ ≤ , we get k b x st +1 − x st k + k x st − b x st k + √ ηε ≥ √ (cid:18) η max x ′ ( x st − x ′ s ) Q st y st (cid:19) ≥ η x ′ ( x st − x ′ s ) Q st y st . Similarly, we have k b y st +1 − y st k + k y st − b y st k + √ ηε ≥ η max y ′ x st Q st ( y ′ s − y st ) . Combining themand using k z − z ′ k ≥ k x − x ′ k + k y − y ′ k , we get k b z st +1 − z st k + k z st − b z st k + √ ηε ≥ η (cid:18) max x ′ ( x st − x ′ s ) Q st y st + max y ′ x st Q st ( y ′ s − y st ) (cid:19) ≥ η (cid:16) max y ′ x st Q st y ′ s − min x ′ x ′ s Q st y st (cid:17) = η y ′ (cid:16)b x st +1 Q s⋆ y ′ s + x st ( Q st − Q s⋆ ) y ′ s + ( x st − b x st +1 ) Q s⋆ y ′ s (cid:17) − η x ′ (cid:16) x ′ s Q s⋆ b y st +1 + x ′ s ( Q st − Q s⋆ ) y st + x ′ Q s⋆ ( y st − b y st +1 ) (cid:17) ≥ η x ′ ,y ′ (cid:16)b x st +1 Q s⋆ y ′ s − x ′ s Q s⋆ b y st +1 (cid:17) − η Γ t − η − γ ) (cid:0) k b x st +1 − x st k + k b y st +1 − y st k (cid:1) ( k Q s⋆ k ≤ − γ ) ≥ η st +1 − η Γ t − η − γ ) k b z st +1 − z st k (by the definition of ∆ st +1 ) ≥ η st +1 − η Γ t − k b z st +1 − z st k . (25)Then notice that we have θ st +1 + η Γ t + 2 η ε ≥ k b z st +1 − z st k + k z st − b z st k + η Γ t η ε (by the definition of θ st +1 and that η Γ t ≤ η − γ ≤ ) ≥ (cid:18) k b z st +1 − z st k + k z st − b z st k + η Γ t √ ηε (cid:19) (Cauchy-Schwarz inequality) ≥ η (cid:0) ∆ st +1 (cid:1) . (by Eq. (25) and notice that ∆ st +1 ≥ ) Lemma 26 (Key Lemma for Average Duality-gap Bounds)
For all t ≥ , we have dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) − θ st +1 + θ st − η
128 (∆ st +1 ) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
Combining Lemma 25 with Lemma 24, we get dist ⋆ ( b z st +1 ) ≤ dist ⋆ ( b z st ) − θ st +1 − θ st +1 + θ st + 4 η Γ t + 8 η k Q st − Q st − k + 6 ηε ≤ dist ⋆ ( b z st ) − θ st +1 − (cid:18) η (cid:0) ∆ st +1 (cid:1) − η Γ t − η ε (cid:19) + θ st + 4 η Γ t + 8 η k Q st − Q st − k + 6 ηε ≤ dist ⋆ ( b z st ) − θ st +1 + θ st − η (cid:0) ∆ st +1 (cid:1) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. ( ηε ≤ ) Lemma 27 (Key Lemma for Point-wise Convergence Bounds)
There exists a constant C ′ > (which depends on the transition and the loss/payoff functions) such that for all t ≥ , dist ⋆ ( b z st +1 ) + 4 . θ st +1 ≤
11 + η C ′ (dist ⋆ ( b z st ) + 4 . θ st ) + 5 η Γ t + 8 η k Q st − Q st − k − θ st + 7 ηε. Proof
By Theorem 5 of (Wei et al., 2021) or Lemma 3 of (Gilpin et al., 2012), we have ∆ st +1 ≥ C dist ⋆ ( b z st +1 ) for some problem-dependent constant < C ≤ − γ ( C depends on { Q s⋆ } s ). Thus Theorem 26implies dist ⋆ ( b z st +1 ) + 5 θ st +1 ≤ dist ⋆ ( b z st ) + θ st − η C
128 dist ⋆ ( b z st +1 ) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. By defining C ′ = C , we further get dist ⋆ ( b z st +1 ) + 51 + η C ′ θ st +1 ≤
11 + η C ′ (cid:0) dist ⋆ ( b z st ) + θ st + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε (cid:1) ≤
11 + η C ′ (dist ⋆ ( b z st ) + θ st ) + 5 η Γ t + 8 η k Q st − Q st − k + 7 ηε. Notice that η C ′ ≥ × ≥ . . Thus we further have dist ⋆ ( b z st +1 ) + 4 . θ st +1 ≤
11 + η C ′ (dist ⋆ ( b z st ) + 4 . θ st ) + 5 η Γ t + 8 η k Q st − Q st − k − θ st + 7 ηε where in the last inequality we use η C ′ ≤ . η C ′ − because η C ′ ≤ × . Appendix D. Proof for Step 3: Bounding k Q st − Q st − k Lemma 28
We have for t ≥ and all s ∈ S , k Q st − Q st − k ≤ γ (1 − γ ) J t − + 2 γ γ K t − + 16 γ ε − γ . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
It is equivalent to prove that for all t ≥ , k Q st +1 − Q st k ≤ γ (1 − γ ) J t + 2 γ γ K t + 16 γ ε − γ . By definition, Q st ( a, b ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − i , we have k Q st +1 − Q st k = max a,b ( Q st +1 ( a, b ) − Q st ( a, b )) ≤ γ max s ′ (cid:16) V s ′ t − V s ′ t − (cid:17) (26)Now it suffices to upper bound (cid:0) V st − V st − (cid:1) for any s . By Corollary 22, we have V st − = P t − τ =1 α τt − ρ sτ .Therefore, V st − V st − = α t (cid:0) ρ st − V st − (cid:1) = α t ρ st − t − X τ =0 α τt − ρ sτ ! = α t t − X τ =0 α τt − ( ρ st − ρ sτ ) ! (because P t − τ =0 α τt − = 1 )In the following calculation, we omit the superscript s for simplicity. By defining diff h , | ρ h − ρ h − | , we have ( V t − V t − ) ≤ ( α t ) t − X τ =0 α τt − ( ρ t − ρ τ ) ! ≤ ( α t ) t − X τ =0 α τt − t X h = τ +1 ( ρ h − ρ h − ) ! ≤ ( α t ) t − X τ =0 α τt − t X h = τ +1 diff h ! = ( α t ) t X h =1 h − X τ =0 α τt − diff h ! ≤ ( α t ) t X h =1 δ h − t − diff h ! . (by Lemma 35)Then we continue: ( V t − V t − ) ≤ ( α t ) t X h =1 δ h − t − diff h ! AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES ≤ ( α t ) t X h =1 δ h − t − ! t X h =1 δ h − t − diff h ! (Cauchy-Schwarz inequality) ≤ t X h =1 α h δ h − t − ! t X h =1 α h δ h − t − diff h ! ( α t ≤ α h for h ≤ t ) ≤ t X τ =1 α τt diff τ (note that α h δ h − t − = α h Q t − τ = h (1 − α τ ) ≤ α h Q tτ = h +1 (1 − α τ ) = α ht ) = t X τ =1 α τt (cid:16) ρ t − x τ Q τ y τ + x τ ( Q τ − Q τ − ) y τ + ( x τ − x τ − ) Q τ − y τ + x τ − Q τ − ( y τ − y τ − ) + x τ − Q τ − y τ − − ρ t − (cid:17) ≤ t X τ =1 α τt (cid:18) ε − γ + 21 + γ k Q τ − Q τ − k + 8(1 − γ ) k x τ − x τ − k + 8(1 − γ ) k y τ − y τ − k + 8 ε − γ (cid:19) , where we use ( a + b + c + d + e ) ≤ − γ a + γ b + − γ c + − γ d + − γ e which is due toCauchy-Schwarz inequality. By Lemma 21 and the definitions of J st , K st , J t , K t in Definition 13and Definition 14, t X τ =1 α τt k Q sτ − Q sτ − k = K st ≤ K t , t X τ =1 α τt k z sτ − z sτ − k ≤ J st ≤ J t . Combining them with the previous upper bound for ( V st − V st − ) , we get ( V st − V st − ) ≤ − γ ) J t + 21 + γ K t + 16 ε − γ for all s . Further combining this with Eq. (26), we get k Q st +1 − Q st k ≤ γ (1 − γ ) J t + 2 γ γ K t + 16 γ ε − γ . Appendix E. Proof for Steps 4 and 5: Bounding k Q st − Q s⋆ k Lemma 29
For all t ≥ , Γ t ≤ γ t − X τ =1 α τt − Γ τ + Reg t − + ε ! . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
We proceed with Q st +1 ( a, b )= σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t i = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt ρ s ′ τ (Corollary 22) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt x s ′ τ Q s ′ τ y s ′ τ + ε (by the definition of ρ s ′ τ and that P tτ =1 α τt = 1 for t ≥ ) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt b x s ′ t⋆ Q s ′ τ y s ′ τ + Reg t + ε (by the definition of Reg t ) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt b x s ′ t⋆ Q s ′ ⋆ y s ′ τ + t X τ =1 α τt Γ τ + Reg t + ε (by the definition of Γ τ ) ≤ σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) " t X τ =1 α τt b x s ′ t⋆ Q s ′ ⋆ y s ′ ⋆ + γ t X τ =1 α τt Γ τ + Reg t + ε ! (by definition of y s ′ ⋆ ) = σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ ⋆ i + γ t X τ =1 α τt Γ τ + Reg t + ε ! ( P tτ =1 α τt = 1 for t ≥ ) = Q s⋆ ( a, b ) + γ t X τ =1 α τt Γ τ + Reg t + ε ! . Similarly, Q st +1 ( a, b ) ≥ Q s⋆ ( a, b ) − γ t X τ =1 α τt Γ τ + Reg t + ε ! . They jointly imply Γ t +1 ≤ γ t X τ =1 α τt Γ τ + Reg t + ε ! . Lemma 30
For any state s and time t ≥ , Reg t ≤ η Z t + 4 η (1 − γ ) J t + 4 ηK t + 3 ε. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
Summing the first bound in Lemma 23 over τ = 1 , . . . , t with weights α τt , and droppingnegative terms −k b x st +1 − x st k − k x st − b x t k , we get t X τ =1 α τt ( x sτ − b x sτ⋆ ) Q sτ y sτ ≤ t X τ =1 α τt η (cid:0) dist ⋆ ( b x sτ ) − dist ⋆ ( b x sτ +1 ) (cid:1) + 4 η (1 − γ ) t X τ =1 α τt k y sτ − y sτ − k + 4 η t X τ =1 α τt k Q sτ − Q sτ − k + 3 ε ≤ α t η dist ⋆ ( b x s ) + t X τ =2 α τt − α τ − t η dist ⋆ ( b x sτ ) + 4 η (1 − γ ) t X τ =1 α τt k y sτ − y sτ − k + 4 η t X τ =1 α τt k Q sτ − Q sτ − k + 3 ε ≤ α t η dist ⋆ ( b x s ) + t X τ =2 α τt − α τ − t η dist ⋆ ( b x sτ ) + 4 η (1 − γ ) J st + 4 ηK st + 3 ε. (27)Observe that by definition, we have for τ ≥ , α τt − α τ − t = α τt (cid:18) − α τ − (1 − α τ ) α τ (cid:19) = α τt × α τ − α τ − + α τ − α τ α τ ≤ α τ − α τt where in the inequality we use α τ ≤ α τ − . Using this in Eq. (27), we get t X τ =1 α τt ( x sτ − b x sτ ∗ ) Q sτ y sτ ≤ α t η dist ⋆ ( b x s ) + 12 η t X τ =2 α τt α τ − dist ⋆ ( b x sτ ) + 4 η (1 − γ ) J st + 4 ηK st + 3 ε = 12 η t X τ =1 α τt α τ − dist ⋆ ( b x sτ ) + 4 η (1 − γ ) J st + 4 ηK st + 3 ε (recall that α = 1 )Using J sn ≤ J n , K sn ≤ K n , and the definition of Z t , Reg t finishes the proof. Appendix F. Combining Lemmas to Show Last-iterate Convergence
In this section, we provide proofs for Theorem 1 and Theorem 2. To achieve so, we first proveLemma 31 by combining the results in Appendix D and Appendix E. Then we combine Theorem 26,Lemma 27, and Lemma 31 to prove Theorem 1 and Theorem 2.
Lemma 31
For any s and t ≥ , η Γ t + 8 η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) ≤ max s ′ C η (1 − γ ) t − X τ =1 β τt ( θ s ′ τ + θ s ′ τ +1 ) + C t − X τ =1 β τt α τ − dist ⋆ ( b z s ′ τ ) ! + 80 β t + 80 ηε (1 − γ ) . where C = 1152 × and C = 10 . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
By Lemma 28, for all t ≥ , η k Q st − Q st − k ≤ η γ (1 − γ ) J t − + 2 η γ γ K t − + 16 η ε − γ . (28)By Lemma 29 and Lemma 30, for all t ≥ , η Γ t ≤ γ t − X τ =1 α τt − η Γ τ + 4 η (1 − γ ) J t − + 4 η K t − + 12 Z t − + 4 ηε. (29)Now, multiply Eq. (29) with − γ , and then add it to Eq. (28). Then we get that for t ≥ , η k Q st − Q st − k + 1 − γ η Γ t ≤ γ t − X τ =1 α τt − (cid:18) − γ η Γ τ (cid:19) + (cid:18) γ (1 − γ ) + 14(1 − γ ) (cid:19) η J t − + (cid:18) γ γ + 1 − γ (cid:19) η K t − + (1 − γ ) Z t −
32 + ηε ≤ γ t − X τ =1 α τt − (cid:18) − γ η Γ τ (cid:19) + 9(1 − γ ) η J t − + γη K t − + (1 − γ ) Z t −
32 + ηε (see explanation below) ≤ γ t − X τ =1 α τt − (cid:18) − γ η Γ τ + η max s ′ k Q s ′ τ − Q s ′ τ − k (cid:19) + 9(1 − γ ) η J t − + (1 − γ ) Z t −
32 + ηε , where in the second inequality we use that γ γ + − γ − γ = (1 − γ ) (cid:16) − γ γ (cid:17) ≤ since γ ≥ ,and in the last inequality, we use K st − = P t − τ =1 α τt − k Q sτ − Q sτ − k . Define the new variable u t = η max s k Q st − Q st − k + 1 − γ η Γ t . Then the above implies that for all t ≥ , u t ≤ γ t − X τ =1 α τt − u τ + 9(1 − γ ) η J t − + (1 − γ ) Z t −
32 + ηε . (30)Observe that Eq. (30) is in the form of Lemma 33 with the following choices: g t = u t , ∀ t ≥ ,h t = ( u t + ηε for t = 1 η (1 − γ ) J t − + (1 − γ )32 Z t − + ηε for t ≥ and get that for t ≥ , u t ≤ η (1 − γ ) t X τ =2 β τt J τ − + 1 − γ t X τ =2 β τt Z τ − + β t u + ηε t X τ =1 β τt ≤ η (1 − γ ) t X τ =2 β τt J τ − + 1 − γ t X τ =2 β τt Z τ − + (1 − γ ) β t + ηε − γ (by Lemma 38) AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES because u t ≤ η (1 − γ ) + − γ ≤ − γ . Further using Lemma 34 on the first two terms on the right-handside, and noticing that γ ≤ , we further get that for t ≥ , u t ≤ max s η (1 − γ ) t − X τ =1 β τt (cid:13)(cid:13) z sτ − z sτ − (cid:13)(cid:13) + 1 − γ t − X τ =1 β τt α τ − dist ⋆ ( b z sτ ) + (1 − γ ) β t + ηε − γ ≤ max s η (1 − γ ) t − X τ =1 β τt ( k z sτ − b z sτ k + k b z sτ − z sτ − k )+1 − γ t − X τ =1 β τt α τ − dist ⋆ ( b z sτ ) + (1 − γ ) β t + ηε − γ ≤ max s η (1 − γ ) t − X τ =1 β τt ( θ sτ +1 + θ sτ ) + 1 − γ t − X τ =1 β τt α τ − dist ⋆ ( b z sτ ) + (1 − γ ) β t + ηε − γ . (31)Finally, notice that according to the definition of u t , we have η Γ t + 8 η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) ≤ − γ u t .Combining Eq. (31), we finish the proof for case for t ≥ . The case for t = 1 is trivial since η Γ t + 8 η (cid:13)(cid:13) Q st − Q st − (cid:13)(cid:13) ≤ ≤
80 = 80 β . Proof of Theorem 1.
Define C α ( T ) , P Tt =1 α t and let C β be an upper bound of P ∞ t = τ β τt for any τ . With the choice of α t specified in the theorem, we have C α ( T ) = 1 + P Tt =1 H +1 H + t = O ( H log T ) = O (cid:16) log T − γ (cid:17) . By Lemma 40, we have C β ≤ − γ + 3 . Define S = |S| .Combining Lemma 31 and Theorem 26, we get that for t ≥ , η
128 (∆ st +1 ) ≤ dist ⋆ ( b z st ) − dist ⋆ ( b z st +1 ) − θ st +1 + θ st + max s ′ C η (1 − γ ) t − X τ =1 β τt ( θ s ′ τ + θ s ′ τ +1 ) + C t − X τ =1 β τt α τ − dist ⋆ ( b z s ′ τ ) ! + 80 β t + 87 ηε (1 − γ ) . Summing the above over s ∈ S and t ∈ [ T − , and denoting Θ t = P s θ st , we get η T X t =1 X s (∆ st ) ≤ O ( S ) − T X t =1 t + C Sη (1 − γ ) T X t =1 t − X τ =1 β τt (Θ τ + Θ τ +1 )+ O S T X t =1 t − X τ =1 β τt α τ − + S T X t =1 β t + SηεT (1 − γ ) ! (32)since dist ⋆ ( b z sτ ) = O (1) and Θ = O ( S ) . Notice that the following hold T X t =1 t − X τ =1 β τt (Θ τ + Θ τ +1 ) ≤ T − X τ =1 T X t = τ β τt (Θ τ + Θ τ +1 ) ≤ C β T X τ =1 Θ τ , T X t =1 t − X τ =1 β τt α τ − ≤ T X τ =1 T X t = τ β τt α τ − ≤ C β T X τ =1 α τ − = C α ( T ) C β , AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES and P Tt =1 β t ≤ C β . Combining these three inequalities with Eq. (32), we get T X t =1 X s (∆ st ) = 128 η T X t =1 (cid:18) − C β C Sη (1 − γ ) (cid:19) Θ t + O (cid:18) SC α ( T ) C β η + SεTη (1 − γ ) (cid:19) = O (cid:18) SC α ( T ) C β η + SεTη (1 − γ ) (cid:19) . (by our choice of η , we have C β C Sη (1 − γ ) ≤ C Sη (1 − γ ) ≤ )By Cauchy-Schwarz inequality, we further have T X t =1 X s ∆ st ≤ √ ST T X t =1 X s (∆ st ) ! = O S p C α ( T ) C β Tη + ST √ ǫ √ η (1 − γ ) ! . Finally, by Lemma 32, we get T T X t =1 max s,x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) ≤ − γ T T X t =1 max s ∆ st = O S p C α ( T ) C β η (1 − γ ) √ T + S √ ǫ √ η (1 − γ ) ! = O (cid:18) S √ log Tη (1 − γ ) √ T + S √ ǫ √ η (1 − γ ) (cid:19) . Proof of Theorem 2.
Combining Lemma 27 and Lemma 31, we get that for all t ≥ , dist ⋆ ( b z st +1 ) + 4 . θ st +1 ≤
11 + η C ′ (cid:0) dist ⋆ ( b z st ) + 4 . θ st (cid:1) + max s ′ C η (1 − γ ) t − X τ =1 β τt ( θ s ′ τ + θ s ′ τ +1 ) + C t − X τ =1 β τt α τ − dist ⋆ ( b z s ′ τ ) ! + 80 β t − θ st + 87 ηε (1 − γ ) . Summing the above inequality over s ∈ S , and denoting L t = P s dist ⋆ ( b z st ) , Θ t = P s θ st , we getthat for all t , L t +1 + 4 . t +1 ≤
11 + η C ′ ( L t + 4 . t ) + C Sη (1 − γ ) t − X τ =1 β τt (Θ τ + Θ τ +1 )+ C S t − X τ =1 β τt α τ − L τ + 80 Sβ t − t + 87 Sηε (1 − γ ) . (33)The key idea of the following analysis is to use the negative (bonus) term − t to cancelthe positive (penalty) term C Sη (1 − γ ) P tτ =1 β τt Θ τ . Since the time indices do not match, we perform smoothing over time to help. Consider the following weighted sum of L τ + 4 . τ with weights λ τt +1 : t +1 X τ =2 λ τt +1 ( L τ + 4 . τ ) AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES = t X τ =1 λ τ +1 t +1 ( L τ +1 + 4 . τ +1 ) (re-indexing) ≤ t X τ =1 λ τt ( L τ +1 + 4 . τ +1 ) (Lemma 37) ≤
11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + C Sη (1 − γ ) t X τ =1 λ τt τ − X i =1 β iτ (Θ i + Θ i +1 )+ C S t X τ =1 λ τt τ − X i =1 β iτ α i − L i + 80 S t X τ =1 λ τt β τ − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤
11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + C Sη (1 − γ ) t − X i =1 t X τ = i λ τt β iτ ! (Θ i + Θ i +1 )+ C S t − X i =1 t X τ = i λ τt β iτ ! α i − L i + 80 S t X τ =1 λ τt β τ − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤
11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 3 C Sη (1 − γ ) t − X τ =1 λ τt (Θ τ + Θ τ +1 )+ 3 C S − γ t − X τ =1 λ τt α τ − L τ + 240 S − γ λ t − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt (by Lemma 36) ≤
11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 6 C Sη (1 − γ ) t X τ =1 λ τt Θ τ + 3 C S − γ t − X τ =1 λ τt α τ − L τ + 240 S − γ λ t − t X τ =1 λ τt Θ τ + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤
11 + η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 3 C S − γ t − X τ =1 λ τt α τ − L τ + 240 S − γ λ t + 87 Sηε (1 − γ ) t X τ =1 λ τt (by our choice of η , C Sη (1 − γ ) ≤ )where in the second-to-last inequality we use Lemma 41: with the special choice of α t specified inthe theorem, we have λ τt = α t ≤ λ τ +1 t for τ ≤ t − .Let t = min n τ : C S − γ α τ ≤ η C ′ o . Then we have t +1 X τ =2 λ τt +1 ( L τ + 4 . τ ) ≤ (cid:18)
11 + η C ′ + η C ′ (cid:19) t X τ =1 λ τt ( L τ + 4 . τ ) + 3 C S − γ min { t ,t } X τ =1 λ τt α τ − L τ + 240 S − γ λ t + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤
11 + 0 . η C ′ t X τ =3 λ τt ( L τ + 4 . τ ) + 12 C S − γ min { t ,t } X τ =1 λ τt α τ − + 240 S − γ λ t + 87 Sηε (1 − γ ) t X τ =1 λ τt . ( ηC ′ ≤ − according to Lemma 27, L τ ≤ S · max z,z ′ k z − z ′ k ≤ S ) AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Finally, we add λ t +1 ( L + 4 . ) to both sides, and note that λ t +1 ( L + 4 . ) = α t +1 ( L + 4 . ) ≤ α t ( L + 4 . ) ≤ α t · S = 22 Sλ t , where the first and second equality is by Lemma 41. Then we get t +1 X τ =1 λ τt +1 ( L τ + 4 . τ ) ≤
11 + 0 . η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 12 C S − γ min { t ,t } X τ =1 λ τt α τ − + 240 S − γ λ t + 22 Sλ t + 87 Sηε (1 − γ ) t X τ =1 λ τt ≤
11 + 0 . η C ′ t X τ =1 λ τt ( L τ + 4 . τ ) + 274 C S − γ min { t ,t } X τ =1 λ τt α τ − + 87 Sηε (1 − γ ) t X τ =1 λ τt . Define Y t , t X τ =1 λ τt ( L τ + 4 . τ ) . Then we can further write that for t ≥ , Y t +1 ≤
11 + 0 . η C ′ Y t + 274 C S t − γ λ min { t ,t } t + 87 Sηε (1 − γ ) t X τ =1 λ τt . (upper bounding α τ − by )Applying Lemma 39 with c = . η C ′ . η C ′ , g t = Y t +1 , h t = C S t − γ λ min { t ,t } t + Sηε (1 − γ ) P tτ =1 λ τt ,we get Y t ≤ Y (1 + 0 . η C ′ ) − t + 20 . η C ′ C S t − γ + 87 Sηε (1 − γ ) sup t ′ ∈ [1 , t ] t ′ X τ =1 λ τt ! (1 + 0 . η C ′ ) − t + 20 . η C ′ C S t − γ sup t ′ ∈ [ t ,t ] λ min { t ,t ′ } t ′ + 87 Sηε (1 − γ ) sup t ′ ∈ [ t ,t ] t ′ X τ =1 λ τt ! . ( λ τt ≤ )With the choice of α t = H +1 H + t where H = − γ , we have t = Θ (cid:18) C S (1 − γ ) η C ′ ( H + 1) (cid:19) = Θ (cid:18) S (1 − γ ) η C ′ (cid:19) sup t ′ ∈ [1 ,t ] t ′ X τ =1 λ τt ≤ sup t ′ ∈ [1 ,t ] t ′ α t ≤ t ( H + 1) H + t ≤ H + 1 = O (cid:18) − γ (cid:19) (Lemma 41) sup t ′ ∈ [ t ,t ] λ min { t ,t ′ } t ′ = ( if t ≤ t α t else (Lemma 41) AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Combining them and noticing that (1 + 0 . η C ′ ) − t = O ( t ) when t ≥ η C ′ , we get that for t ≥ t = Θ (cid:16) S (1 − γ ) η C ′ (cid:17) , Y t = O (cid:18) S η C ′ (1 − γ ) α t + SεηC ′ (1 − γ ) (cid:19) = O (cid:18) S η C ′ (1 − γ ) t + SεηC ′ (1 − γ ) (cid:19) . Since Y t ≤ S + 22 S ( t − α t ≤ O ( S log t − γ ) , the above bound also trivially holds for t ≤ t .Then noticing that L t ≤ Y t finishes the proof. Lemma 32
For any policy pair x, y , the duality gap on the game can be related to the duality gapon individual states as follows: max s,x ′ ,y ′ (cid:0) V sx,y ′ − V sx ′ ,y (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . Proof
Notice that for any policy x and state s , max y ′ V sx,y ′ − V s⋆ = X a,b x s ( a ) y ′ s ( b ) Q sx,y ′ ( a, b ) − X a,b x s⋆ ( a ) y s⋆ ( b ) Q s⋆ ( a, b )= X a,b x s ( a ) y ′ s ( b ) (cid:0) Q sx,y ′ ( a, b ) − Q s⋆ ( a, b ) (cid:1) + X a,b (cid:0) x s ( a ) y ′ s ( b ) − x s⋆ ( a ) y s⋆ ( b ) (cid:1) Q s⋆ ( a, b )= γ X a,b x s ( a ) y ′ s ( b ) p ( s ′ | s, a, b ) (cid:16) V s ′ x,y ′ − V s ′ ⋆ (cid:17) + x s Q s⋆ y ′ s − x s⋆ Q s⋆ y s⋆ ≤ γ max s ′ ,y ′ (cid:16) V s ′ x,y ′ − V s ′ ⋆ (cid:17) + x s Q s⋆ y ′ s − x s⋆ Q s⋆ y s⋆ . Taking max over s on two sides and rearranging, we get max s,y ′ (cid:0) V sx,y ′ − V s⋆ (cid:1) ≤ − γ max s,y ′ (cid:0) x s Q s⋆ y ′ s − x s⋆ Q s⋆ y s⋆ (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . Similarly, max s,x ′ (cid:0) V s⋆ − V sx ′ ,y (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . Combining the two inequalities, we get max s,x ′ ,y ′ (cid:0) V sx,y ′ − V sx ′ ,y (cid:1) ≤ − γ max s,x ′ ,y ′ (cid:0) x s Q s⋆ y ′ s − x ′ s Q s⋆ y s (cid:1) . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Appendix G. Auxiliary Lemmas
G.1. Interactions between the Auxiliary CoefficientsLemma 33
Let { g t } t =1 , ,... , { h t } t =1 , ,... be non-negative sequences that satisfy g t ≤ γ P t − τ =1 α τt − g τ + h t for all t ≥ . Then g t ≤ P tτ =1 β τt h τ . Proof
We prove it by induction. When t = 1 , the condition guarantees g ≤ h = β h . Supposethat it holds for , . . . , t − . Then g t ≤ γ t − X τ =1 α τt − g τ + h t ≤ γ t − X τ =1 α τt − τ X i =1 β iτ h i ! + h t = t − X i =1 t − X τ = i γα τt − β iτ ! h i + h t It remains to prove that P t − τ = i γα τt − β iτ ≤ β it for all i ≤ t − . We use another induction to showthis. Fix i and t , and define the partial sum ζ r = P rτ = i γα τt − β iτ for r ∈ [ i, t − . Below we showthat ζ r ≤ α i r Y τ = i (1 − α τ + α τ γ ) t − Y τ = r +1 (1 − α τ ) . (34)Notice that the right-hand side above is β it when r = t − , which is exactly what we want to prove.When r = i , ζ r = γα it − = γα i Q t − τ = i +1 (1 − α τ ) ≤ α i (1 − α i + α i γ ) Q t − τ = i +1 (1 − α τ ) wherethe inequality is because − α i + α i γ − γ = (1 − α i )(1 − γ ) ≥ . Now assume that Eq. (34) holdsup to r for some r ≥ i . Then ζ r +1 = ζ r + β ir +1 γα r +1 t − ≤ α i r Y τ = i (1 − α τ + α τ γ ) t − Y τ = r +1 (1 − α τ ) + α i r Y τ = i (1 − α τ + α τ γ ) ! γα r +1 t − Y τ = r +2 (1 − α τ )= α i r +1 Y τ = i (1 − α τ + α τ γ ) t − Y τ = r +2 (1 − α τ ) . This finishes the induction.
Lemma 34
Let { h t } t =1 , ,... and { k t } t =1 , ,... be non-negative sequences that satisfy h t = P tτ =1 α τt k τ .Then P tτ =2 β τt h τ − ≤ γ P t − τ =1 β τt k τ . Proof
By the assumption on h τ , we have t X τ =2 β τt h τ − ≤ t X τ =2 β τt τ − X i =1 α iτ − k i ! = t − X i =1 t X τ = i +1 β τt α iτ − ! k i AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
It remains to prove that for i < t , P tτ = i +1 β τt α iτ − ≤ γ β it , or equivalently, γ P tτ = i +1 β τt α iτ − ≤ γ β it . Below we use another induction to prove this. Fix i and t , and define the partial sum ζ r = γ P tτ = r β τt α iτ − for r ∈ [ i + 1 , t ] . We will show that ζ r ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) . (35)For the base case r = t , we have ζ r = γβ tt α it − = γα i t − Y τ = i +1 (1 − α τ ) ≤ α i t − Y τ = i +1 (1 − α τ ) . Suppose that Eq. (35) holds up to r for some r ≤ t . Then ζ r − = ζ r + α ir − γβ r − t ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) + α i r − Y τ = i +1 (1 − α τ ) ! γα r − t − Y τ = r − (1 − α τ + α τ γ ) ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) ! × (1 − α r − + γα r − (1 − α r − + α r − γ )) ≤ α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r (1 − α τ + α τ γ ) ! × (1 − α r − + α r − γ )= α i r − Y τ = i +1 (1 − α τ ) t − Y τ = r − (1 − α τ + α τ γ ) . This finishes the induction. Applying the result with r = i + 1 , we get ζ i +1 ≤ α i t − Y τ = i +1 (1 − α τ + α τ γ ) = β it − α i + α i γ ≤ β it γ where the last inequality is by − α i + α i γ − γ = (1 − α i )(1 − γ ) ≥ . This finishes the proof. Lemma 35
For ≤ h ≤ t , P hτ =0 α τt = δ ht . Proof
We prove it by induction on h . When h = 0 , P hτ =0 α τt = α t = Q tτ =1 (1 − α τ ) = δ ht since α = 1 . Suppose that the formula holds for h . Then P h +1 τ =0 α τt = P hτ =0 α τt + α h +1 t = Q tτ = h +1 (1 − α τ )+ α h +1 Q tτ = h +2 (1 − α τ ) = Q tτ = h +2 (1 − α τ ) = δ h +1 t , which finishes the induction. Lemma 36
For any positive integers i, t with i ≤ t , P tτ = i λ τt β iτ ≤ − γ λ it . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
Notice that t X τ = i λ τt β iτ = λ it + t X τ = i +1 λ τt β iτ . (36)Below we use induction to prove t X τ = r λ τt β iτ ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ for r ∈ [ i + 1 , t ] . When r = t , P tτ = r λ τt β iτ = λ tt β it = β it = α i Q t − τ = i (1 − α τ (1 − γ )) . Suppose thisholds for some r ≤ t . Then t X τ = r − λ τt β iτ ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ + β ir − λ r − t ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ + α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17)! α r − t − Y τ = r − λ τ ≤ " α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ − γ (1 − α r − (1 − γ )) + α r − (cid:19) ( λ r − ≤ ) = " α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r λ τ − γ (cid:18) − α r − (1 − γ ) (cid:19)(cid:19) ≤ − γ α i r − Y τ = i (cid:16) − α τ (1 − γ ) (cid:17) t − Y τ = r − λ τ , ( λ r − ≥ − α r − (1 − γ ) by the definition of λ r − )which finishes the induction. Notice that this implies t X τ = i +1 λ τt β iτ ≤ − γ α i (1 − α i (1 − γ )) t − Y τ = i +1 λ τ ≤ − γ α i t − Y τ = i λ τ = 21 − γ λ it where the second inequality is by the definition of λ i . Thus, t X τ = i λ τt β iτ ≤ − γ λ it . (37) Lemma 37 λ τ +1 t +1 ≤ λ τt . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
When τ < t , we have λ τ +1 t +1 λ τt = α τ +1 Π ti = τ +1 λ i α τ Π t − i = τ λ i ≤ α τ +1 α τ λ τ ≤ where in the first inequality we use λ t ≤ and in the second inequality we use the definition of λ τ .When τ = t , we have λ τ +1 t +1 λ τt = = 1 . Lemma 38 P tτ =1 β τt ≤ − γ . Proof
Below we use induction to prove that for all r = 1 , , . . . , t − , r X τ =1 β τt ≤ − γ t − Y i = r (1 − α i + α i γ ) . When r = 1 , the left-hand side is β t = α Q t − i =1 (1 − α i + α i γ ) ≤ − γ Q t − i =1 (1 − α i + α i γ ) , whichis the right-hand side.Suppose that this holds for r , then r +1 X τ =1 β τt = β r +1 t + r X τ =1 β τt ≤ α r +1 t − Y i = r +1 (1 − α i + α i γ ) + 11 − γ t − Y i = r (1 − α i + α i γ ) ≤ − γ t − Y i = r +1 (1 − α i + α i γ ) ! ( α r +1 (1 − γ ) + 1 − α r (1 − γ )) ≤ − γ t − Y i = r +1 (1 − α i + α i γ ) (because α r +1 ≤ α r )which finishes the induction.Therefore, P tτ =1 β τt = 1 + P t − τ =1 β τt ≤ − γ ≤ − γ . Lemma 39
Let { g t } t =0 , , ,... , { h t } t =1 , ,... be non-negative sequences that satisfy g t ≤ (1 − c ) g t − + h t for some c ∈ (0 , for all t ≥ . Then g t ≤ g (1 − c ) t + max τ ∈ [1 , t / ] h τ c (1 − c ) t + max τ ∈ [ t / ,t ] h τ c . Proof
We first show that g t ≤ g (1 − c ) t + t X τ =1 (1 − c ) t − τ h τ . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
The case of t = 1 is clear. Suppose that this holds for g t . Then g t +1 ≤ (1 − c ) g (1 − c ) t + t X τ =1 (1 − c ) t − τ h τ ! + h t +1 = g (1 − c ) t +1 + t +1 X τ =1 (1 − c ) t +1 − τ h τ , which finishes the induction. Therefore, g t ≤ g (1 − c ) t + t / X τ =1 (1 − c ) t − τ max τ ∈ [1 , t / ] h τ + t X τ = t / +1 (1 − c ) t − τ max τ ∈ [ t / ,t ] h τ ≤ g (1 − c ) t + max τ ∈ [1 , t / ] h τ c (1 − c ) t + max τ ∈ [ t / ,t ] h τ c . G.2. Some Properties for the choice of α t = H +1 H + t Lemma 40
For the choice α t = H +1 H + t with H ≥ − γ , we have P ∞ t = τ β τt ≤ H + 3 . Proof
When t ≥ τ + 2 , β τt = α τ t − Y i = τ (1 − α i (1 − γ )) ≤ α τ t − Y i = τ (cid:18) − α i × H + 1 (cid:19) ( H + 1 ≥ − γ ) = α τ t − Y i = τ (cid:18) − H + i (cid:19) = H + 1 H + τ × H + τ − H + τ × H + τ − H + τ + 1 × · · · × H + t − H + t − H + 1 H + τ × ( H + τ − H + τ − H + t − H + t − H + 1 H + τ ( H + τ − H + τ − (cid:18) H + t − − H + t − (cid:19) ≤ ( H + 1)( H + τ − (cid:18) H + t − − H + t − (cid:19) . Therefore, ∞ X t = τ +2 β τt ≤ ( H + 1)( H + τ − × H + τ ≤ H + 1 , and thus P ∞ t = τ β τt ≤ H + 3 . AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Lemma 41
For the choice α t = H +1 H + t with H ≥ − γ , we have λ τt = α t for τ < t . Proof
With this choice of α t , λ t = max (cid:26) H + tH + t + 1 , − − γ × H + 1 H + t (cid:27) = max (cid:26) − H + t + 1 , − − γ × H + 1 H + t (cid:27) . By the condition, we have − γ × H +1 H + t ≥ H + t ≥ H + t +1 . Therefore, λ t = H + tH + t +1 = α t +1 α t . Thusfor τ < t , λ τt = α τ t − Y i = τ α i +1 α i = α t . Appendix H. Analysis on Sample Complexity
H.1. Proof of Theorem 3Proof
As long as we can make (cid:12)(cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ ε |A| (38)hold with high probability, then (cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st y st (cid:12)(cid:12) ≤ (cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st e y st (cid:12)(cid:12) + (cid:12)(cid:12) e ⊤ a Q st e y st − e ⊤ a Q st y st (cid:12)(cid:12) ≤ ε |A| + − γ × ε ′ |A| ≤ ε |A| , which implies k ℓ st − Q st y st k ≤ ε . We can similarly ensure k r st − Q s ⊤ t x st k ≤ ε and | ρ st − x s ⊤ t Q st y st | ≤ ε by the same way. Let N s,a , P Li =1 [ s i = s, a i = a ] be the number oftimes the ( s, a ) pair is visited. For a deterministic N , we can use Azuma-Hoeffding inequalityand know that Eq. (38) holds with probability − δ if N = Ω (cid:16) |A| ε log(1 /δ ) (cid:17) . However, N s,a israndom, so we cannot use Azuma-Hoeffding’s inequality directly. Let ( b (1) , s (1) ) , ( b (2) , s (2) ) , . . . bea sequence of independent random variables where b ( i ) ∼ e y st , s ( i ) ∼ p ( ·| s, a, b ( i ) ) , i = 1 , , . . . anddefine e ℓ st,m = m P mi =1 ( σ ( s, a, b ( i ) ) + γV s ( i ) t − ) . It is direct to see that e ℓ st,m is an unbiased estimatorof e ⊤ a Q st e y st . Then by Azuma-Hoeffding’s inequality, we haveProb "(cid:12)(cid:12)(cid:12) ℓ st ( a ) − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ O s log( L/δ ) N s,a ! ≤ Prob " ∃ m ∈ [ L ] , (cid:12)(cid:12)(cid:12)e ℓ st,m − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ O r log( L/δ ) m ! ≤ L X m =1 Prob "(cid:12)(cid:12)(cid:12)e ℓ st,m − e ⊤ a Q st e y st (cid:12)(cid:12)(cid:12) ≤ O r log( L/δ ) m ! ≤ δ. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Therefore, with probability at least − δ , Eq. (38) holds if N s,a = Ω (cid:18) |A| ε log( L/δ ) (cid:19) . (39)Now it remains to determine L to make Eq. (39) hold with high probability. Note that by Assumption 1,we know that T s ′ → s e x t , e y t ≤ µ for any s ′ . Let T s,a be the distribution of random variable which is thenumber of rounds between the current state-action pair ( s ′ , a ′ ) and the next occurrence of ( s, a ) under strategy e x st and e y st . The mean of this distribution is t s,a ≤ |A| ε ′ µ ≤ |A| ε ′ µ . Then by Markovinequality, Prob (cid:20) the number of rounds before reaching ( s, a ) ≤ |A| ε ′ µ (cid:21) ≥ . Therefore, with probability at least − δL , within Θ( | A | ε ′ µ log( L/δ )) rounds, we reach ( s, a ) state-action at least pair once. Thus, Eq. (39) holds when L = Ω (cid:16) |A| ε ′ µε log ( L/δ ) (cid:17) with probability − δ . Solving L gives L = e Ω (cid:16) |A| (1 − γ ) µε log (1 /δ ) (cid:17) . The cases for r st ( b ) and ρ st are similar.Finally, using a union bound on all A , B , S , and all iterations, we know that with probability − δ , the ε -approximations are always guaranteed if we use the estimation above and take L = e Ω (cid:16) |A| + |B| (1 − γ ) µε log ( T /δ ) (cid:17) . H.2. Proof of Corollary 4Proof
From Theorem 1, we know that in order to show T P Tt =1 max s,x ′ ,y ′ (cid:16) V s b x t ,y ′ − V sx ′ , b y t (cid:17) ≤ ξ ,it is sufficient to show |S| η (1 − γ ) q log TT ≤ ξ and |S|√ ε √ η (1 − γ ) ≤ ξ . Solving these two inequalities, weget T = Ω (cid:16) |S| η (1 − γ ) ξ log |S| η (1 − γ ) ξ (cid:17) and ε = O (cid:16) η (1 − γ ) ξ |S| (cid:17) . Plugging ε into L in Theorem 3 givesthe required L . H.3. Proof of Corollary 5Proof
From Theorem 2, we know that in order to show |S| P s ∈S dist ⋆ ( b z sT ) ≤ ξ , it is suffi-cient to show |S| η C (1 − γ ) T ≤ ξ and εηC (1 − γ ) ≤ ξ . Solving these two inequalities, we get T = Ω (cid:16) |S| η C (1 − γ ) ξ (cid:17) and ε = O (cid:0) ηC (1 − γ ) ξ (cid:1) . Plugging ε into L in Theorem 3 gives therequired L . Appendix I. Analysis on Rationalily
In this section, we analyze the rationality of our algorithm. First, we present the full pseudocodeof Algorithm 2, which is the single-player-perspective version of Algorithm 1, and then prove thatAlgorithm 2 achieves rationality. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Algorithm 2
Single-Player-Perspective Optimistic Gradient Descent/Ascent for Markov Games
Parameter : γ ∈ [ , , η ≤ q (1 − γ ) S , ε ∈ h , − γ i Parameters : a non-increasing sequence { α t } Tt =1 that goes to zero. Initialization : arbitrarily initialize b x s = x s ∈ ∆ A for all s ∈ S . V s ← for all s ∈ S . for t = 1 , . . . , T do For all s , define Q st ∈ R |A|×|B| as Q st ( a, b ) , σ ( s, a, b ) + γ E s ′ ∼ p ( ·| s,a,b ) h V s ′ t − i , and update b x st +1 = Π ∆ A nb x st − ηℓ st o , (40) x st +1 = Π ∆ A nb x st +1 − ηℓ st o , (41) V st = (1 − α t ) V st − + α t ρ st , (42)where ℓ st and ρ st are ε -approximations of Q st y s and x s ⊤ t Q st y s , respectively. endI.1. Single-Player-Perspective Version of Algorithm 1I.2. Analysis of Algorithm 2 In this section, we prove Theorem 6, which shows the rationality of Algorithm 2. We call theoriginal game
Game 1 and construct a two-player Markov game
Game 2 with the difference be-ing that Player 2 has only one single action (call it ) on each state, the loss function is rede-fined as σ ( s, a,
1) = E b ∼ y s [ σ ( s, a, b )] , and the transition kernel is redefined as p ( s ′ | s, a,
1) = E b ∼ y s [ p ( s ′ | s, a, b )] . Correspondingly, we define Q st ( a,
1) = σ ( s, a,
1) + γ E s ′ ∼ p ( ·| s,a, h V s ′ t − i , b x st +1 = Π ∆ A { b x st − ηℓ st } ,x st +1 = Π ∆ A (cid:8)b x st +1 − ηℓ st (cid:9) ,V st = (1 − α t ) V st − + α t ρ st , where V s = 0 for all s , and ℓ st and ρ st are the same as in Algorithm 2. Clearly, the sequences { b x t , x t } t ∈ [ T ] and { b x t , x t } t ∈ [ T ] are exactly the same (assuming their initializations are the same,that is, b x = b x and x = x ). In the following lemma, we show that ℓ st and ρ st are indeed ε -approximation of Q st ( · , and x s ⊤ t Q st ( a, · ) , which then implies that the sequence { b x t } t ∈ [ T ] isindeed the output of our Algorithm 1 for Game 2 (note that we can think of Player 2 executingAlgorithm 1 in
Game 2 as well since she only has one unique strategy). Realizing that X s⋆ for Game 2 is exactly the set of best responses of y s , we can thus conclude that Theorem 6 is a directcorollary of Theorem 1 and Theorem 2. Lemma 42
For all t and s , ℓ st and ρ st are ε -approximation of Q st ( · , and x s ⊤ t Q st ( a, · ) respectively. AST - ITERATE C ONVERGENCE OF
OGDA IN I NFINITE - HORIZON M ARKOV G AMES
Proof
We prove the result together with V st = V st , Q st ( · ,
1) = Q st y s for all t ∈ [ T ] by induction.When t = 1 , V st = V st clearly holds. In addition, Q s ( a, · ) y s = E b ∼ y s [ σ ( s, a, b )] = Q s ( a, .Therefore ℓ s and ρ s are indeed ε -approximation of Q s ( a, · ) and x s ⊤ Q st ( a, · ) .Suppose that the claim holds at t . By definition and the inductive assumption, we have Q st +1 ( a,
1) = σ ( s, a,
1) + γ E s ′ ∼ p ( ·| s,a, h V s ′ t i = E b ∼ y s h σ ( s, a, b ) + γ E s ′ ∼ p ( s ′ | s,a,b ) h V s ′ t ii = Q st +1 ( a, · ) y s , which also shows that ℓ st +1 and ρ st +1 are indeed ε -approximation of Q st +1 ( · , and x s ⊤ t +1 Q st ( a, · ) (recall x st +1 = x st +1 ). By definition of V st +1 , we also have V st +1 = (1 − α t +1 ) V st + α t ρ st =(1 − α t ) V st + α t ρ st = V st +1 , which finishes the induction., which finishes the induction.