[PDF] Discounting the Past in Stochastic Games

Abstract

Stochastic games, introduced by Shapley, model adversarial interactions in stochastic environments where two players choose their moves to optimize a discounted-sum of rewards. In the traditional discounted reward semantics, long-term weights are geometrically attenuated based on the delay in their occurrence. We propose a temporally dual notion -- called past-discounting -- where agents have geometrically decaying memory of the rewards encountered during a play of the game. We study past-discounted weight sequences as rewards on stochastic game arenas and examine the corresponding stochastic games with discounted and mean payoff objectives. We dub these games forgetful discounted games and forgetful mean payoff games, respectively. We establish positional determinacy of these games and recover classical complexity results and a Tauberian theorem in the context of past discounted reward sequences.

Full PDF

DDiscounting the Past in Stochastic Games

Taylor Dohmen ! University of Colorado, Boulder, USA

Ashutosh Trivedi ! University of Colorado, Boulder, USA

Abstract

Stochastic games, introduced by Shapley, model adversarial interactions in stochastic environmentswhere two players choose their moves to optimize a discounted-sum of rewards. In the traditionaldiscounted reward semantics, long-term weights are geometrically attenuated based on the delayin their occurrence. We propose a temporally dual notion—called past-discounting—where agentshave geometrically decaying memory of the rewards encountered during a play of the game. Westudy past-discounted weight sequences as rewards on stochastic game arenas and examine thecorresponding stochastic games with discounted and mean payoff objectives. We dub these games forgetful discounted games and forgetful mean payoff games , respectively. We establish positionaldeterminacy of these games and recover classical complexity results and a Tauberian theorem in thecontext of past discounted reward sequences.

Computing methodologies → Stochastic games; Theory ofcomputation → Stochastic control and optimization

Keywords and phrases

Stochastic Games, Discounted Reward, Past Discounting, Tauberian Theorems

Two-player zero-sum stochastic games [25] provide a natural model for adversarial interactionsbetween rational agents with competing objectives in uncertain settings. Beginning withan initial configuration of the arena, these games proceed in discrete time where at eachstep the players—named Min and Max—concurrently choose (stochastically) from a set ofstate-dependent actions. Based on their choices, a scalar weight is determined with theinterpretation that this quantity represents a “payment” from Min to Max. Given thecurrent state and the players’ action pair, a probabilistic transition function determinesthe next state and the process repeats infinitely. The goal of Max is to maximize a givenobjective—typically specified as some functional into the real numbers—defined on infinitesequences of payments ( x n ) n ≥ , while the goal of Min is the opposite. The most popularquantitative objectives are the family of discounted objectives, which are well-defined fordiscount factors λ ∈ [0 , Disc ( λ ) def = ( x n ) n ≥ lim n →∞ n X k =0 λ k x k (Discounted Payoff) Mean def = ( x n ) n ≥ lim inf n →∞ n + 1 n X k =0 x k . (Mean Payoff)The scalar λ ∈ [0 ,

1) in a discounted objective can be interpreted as the complement of aprobability (1 − λ ) that the game will halt at any given point in time. At stage k , the quantitytransferred to Max from Min is scaled by λ k : the probability that the game continues.The players are interested in optimizing, in expectation, the total accumulated rewardbefore termination. These semantics are practically appealing as they provide mathematicaljustification for the notion that immediate rewards are more valuable than potential futurerewards, and form the backbone of topics like reinforcement learning [26]. For an infinitesequence of scalars ( x n ) n ≥ , the discounted sum of the sequence (1 − λ ) lim n →∞ P nk =0 λ k x k a r X i v : . [ c s . G T ] F e b Discounting the Past in Stochastic Games may be viewed as a weighted average—a convex combination, in fact—in which elementsoccurring earlier in the sequence are weighted more heavily. On the other hand, when agentsprefer long-run rewards over initial rewards, a more traditional average—the mean payoffobjective—is employed. In this case, each element contributes equally to the total amount,and transient properties of the sequence become less significant.Shapley, in his seminal work [25], showed that stochastic games played on finite arenaswith discounted objectives are positionally determined. In other words, there are optimalstrategies for both players that need only consider the current state of play. Bewley andKohlberg related the discounted and mean payoff objectives asymptotically [5] and showedthe positional determinacy for stochastic games with mean payoff objective [6].

Discounting the Past.

Using Shapley’s discounting as a conceptual basis, we study past-discounting as a temporally dual notion introduced by Alur, et al. [3] for finite sequences. Fora finite sequence of weights ( x , x , . . . , x n ), the past-discounted sum with discount factor γ ∈ [0 , P nk =0 γ n − k x k , which is equivalent to reversing the orderof coefficients in the future discounted sum of the same sequence.To motivate the idea of past-discounting, consider a setting where the weights in astochastic game arena represent a quantity characterizing the performance of an agent.Under such an interpretation, rewards obtained by the agent at each stage represent somequantifiable immediate response to their actions from the environment. In this sense, thetotal sum of a finite sequence of rewards encountered by the agent provides a quantificationof their reputation over that period of time. The total sum may be thought of as aperformance evaluation of the agent done by an evaluator considering each past action withequal significance. In contrast, the past-discounted sum of the finite sequence represents anevaluation of the agent in which actions taken more recently are given greater significance.Here, the discount factor γ can be seen as a parameter encoding the evaluator’s preferencefor recent rewards as opposed to those encountered long ago. If we presuppose the cause ofthis preference to be that the evaluator has imperfect memory that coninously worsens overtime, then γ naturally representsents the rate at which the evaluator forgets or forgives.If, for instance, the agent is a politician, the reward incurred at each step could correspondto their popularity rating, and the past-discounted sum of these ratings would then be their(weighted) average popularity, placing more significance on recent ratings. In this scenario,the past-discounted evaluation is clearly more relevant than the total sum evaluation, sincefavorability of a politician depends on the public’s recollection of their past actions, whichtends to decay with time. More generally, past-discounting makes sense whenever the agentin question can learn and adapt and the evaluator either has a time-decaying memory or atime-decaying perception of significance. Contributions.

The main contribution of this work is defining and studying the notion ofpast-discounting in the more general context long-run and infinite duration optimizationproblems. Perhaps the most straightforward way to define the past-discounted objectivefor an infinite sequence ( x n ) n ≥ of payoffs is to consider the limit lim n →∞ P nk =0 γ n − k x k ofthe past-discounted sums of finite prefixes of increasing length. However, even for boundedsequences, this sequence may not converge as shown in Example 1. ▶ Example 1.

Consider the bounded infinite sequence (1 , , , , , , . . . ). The past-discountedsum for every even step k = 2 n is given as 2 · − ( γ ) n − γ + γ · − ( γ ) n − γ , while, for every odd step k = 2 n + 1, we have that 1 · − ( γ ) n − γ + 2 γ · − ( γ ) n − γ . Hence, every even subsequence convergesto γ − γ , while the odd subsequences converge to γ − γ , and the limit of the sequence as . Dohmen and A. Trivedi 3 whole does not exist.We consider converging limits of these sequences by studying λ -discounted sum andmean payoff of sequences of γ -past-discounted sums. For an infinite bounded sequenceof weights ( x n ) n ≥ , we examine the γ - forgetful reward sequence (cid:0)P nk =0 γ n − k x k (cid:1) n ≥ , andstudy (future) λ -discounted payoff and mean payoff of the γ -forgetful reward sequence. Weare interested in characterizing the optimal strategy for the agent that, irrespective of thechoices made by an adversarial opponent, maximizes the discounted-sum or average-sum ofthe past-discounted sum of performances. Formally, we define discounted and mean payoffobjectives for γ -forgetful reward sequences as follows. FDisc ( γ, λ ) def = ( x n ) n ≥ lim n →∞ n X k =0 λ k · k X i =0 γ k − i · x i (Forgetful Discounted Payoff) FMean ( γ ) def = ( x n ) n ≥ lim inf n →∞ n + 1 n X k =0 k X i =0 γ k − i · x i (Forgetful Mean Payoff)We study past-discounting in the setting of two-player stochastic zero-sum games with finitestate and action spaces. In this context, we show that there are positional strategies for bothplayers that optimize the discounted and limit-average sums of forgetful reward sequences.To this end, we formulate the notions of forgetful discounted games and forgetful mean payoffgames , respectively, as the appropriate mathematical objects. In both cases of forgetfulgames, we are able to recover the best known complexity bounds for computing the gamevalues. Finally, we prove a Tauberian theorem relating the values of these games as thediscount factor approaches unity, which establishes the existence of an asymptotic value forforgetful games in the sense of [5]. A discrete probability distribution over a (possibly infinite) set X is a function d : X → [0 , P x ∈ X d ( x ) = 1 and the support supp ( d ) def = { x ∈ X : d ( x ) > } of d is countable.Let D ( X ) denote the set of all discrete distributions over X . We say a distribution d ∈ D ( X )is a point distribution if d ( x ) = 1 for some x ∈ X .We consider two-player zero-sum games on finite stochastic arenas between two players—name Min and Max—who concurrently choose their actions to move a token along the edgesof a graph. The next state is determined by a probabilistic transition function based on thecurrent state and the choices made by the players. ▶ Definition 2. A stochastic game arena (SGA) G is a tuple ( S, A

Min , A

Max , w, p ) where: S is a set of states, A Min and A Max are set of actions available to player Min and Max, respectively; w : S × A Min × A Max → R is the weight function, and p : S × A Min × A Max → D ( S ) is the probabilistic transition function. We write A Min ( s ) ⊆ A Min and A Max ( s ) ⊆ A Max for the set of actions available to playersMin and Max, respectively, at the state s ∈ S . An SGA is a perfect-information (or, turn-based ) arena if for all s ∈ S we have that at least one of the sets A Min ( s ) and A Max ( s ) issingleton. For states s, s ′ ∈ S and actions ( a, b ) ∈ A Min ( s ) × A Max ( s ), we write p ( s ′ | s, a, b )for p ( s, a, b )( s ′ ). We call an SGA G finite if the set of states S and sets of actions A Min and A Max are finite.

Discounting the Past in Stochastic Games A play of G is an infinite sequence π = ( s , ( a , b ) , s , . . . ) ∈ S × (( A Min × A Max ) × S ) ω such that p ( s i +1 | s i , a i +1 , b i +1 ) > i ≥

0. A finite run is a finite such sequence, thatis, a sequence in S × (( A Min × A Max ) × S ) ∗ . Denote by last ( π ) the final state in the finiterun π . We write Plays G and fPlays G , respectively, for the set of all plays and finite plays ofthe SGA G and Plays G ( s ) and fPlays G ( s ) for the respective subsets of these for which s inthe initial state.A game on an SGA G starts at an initial state s ∈ S .Players Min and Max producean infinite run by concurrently choosing state-dependent actions, and then moving to asuccessor state determined by the transition function. A strategy of player Min in G is a function ν : fPlays → D ( A Min ) is such that for all plays π ∈ fPlays we have that supp ( ν ( π )) ⊆ A Min ( last ( π )). A strategy χ of player Max is defined analogously.A strategy σ is pure if σ ( r ) is a point distribution wherever it is defined; otherwise, σ is mixed . We say that a strategy σ is positional (or synonymously, stationary) if last ( π ) = last ( π ′ )implies σ ( π ) = σ ( π ′ ). Strategies that are not positional are called history dependent . LetΣ Min and Σ

Max be the sets of all strategies of player Min and player Max, respectively.For a SGA G , state s of G and strategy pair ( ν, χ ) ∈ Σ Min × Σ Max , let

Plays ν,χ ( s )( fPlays ν,χ ( s )) denote the set of infinite (finite) plays in which player Min and Max playaccording to ν and χ , respectively. Given a finite play π ∈ fPlays ν,χ ( s ), a basic cylinder set cyl ( π ) is the set of infinite plays in Plays ν,χ ( s ) for which π is a prefix. Using standard resultsfrom probability theory one can construct a probability space ( Plays ν,χ ( s ) , F ν,χ ( s ) , Pr ν,χs )where F ν,χ ( s ) is the smallest σ -algebra generated by the basic cylinder sets and Pr ν,χs : F ν,χ ( s ) → [0 ,

1] is a unique probability measure Pr ν,χs ( cyl ( π )) = Q ki =1 p ( s i | s i − , a i , b i )for π = ( s , ( a , b ) , s , . . . , s k − , ( a k , b k ) , s k ) ∈ fPlays ν,χ ( s ). Given a real-valued randomvariable f : Plays → R , the expression E ν,χs ( f ) denotes the expected value of f with respectto the probability measure Pr ν,χs . ▶ Definition 3.

For an SGA G , the following payoffs of player Min to player Max have beenconsidered extensively in literature (See, for example [15]). Discounted Payoff.

The λ -discounted payoff Disc ( λ ) : Plays → R of a play, for a givendiscount factor λ ∈ [0 , , is defined as follows: Disc ( λ ) def = ( s , ( a , b ) , s , . . . ) lim n →∞ n X k =0 λ k · w ( s k , a k +1 , b k +1 ) . Mean Payoff.

The mean payoff

Mean : P lays → R of a play is given by the long-runaverage-sum of the weight sequence of the play: Mean def = ( s , ( a , b ) , s , . . . ) lim inf n →∞ n + 1 n X k =0 w ( s k , a k +1 , b k +1 ) . We refer to an SGA with a λ -discounted payoff objective as a stochastic discounted game (SDG) D ( λ ) = ( G , λ ) and to an SGA with a mean payoff objective as a stochastic meanpayoff game (SMG) M = G . We later present two novel payoff functions in the contextof past discounted rewards. For this reason, we define the following set of concepts of thedeterminacy, the value, and the optimal strategies for stochastic games with an arbitrarypayoff functions.Given a payoff function Payoff ∈ {

Disc ( λ ) , Mean } , the objective of player Max in thecorresponding game G ∈ { D ( λ ) , M } is to maximize the payoff, while the objective of playerMin is the opposite. For every state s ∈ S , we define its upper value Val ( G , s ) as the minimumpayoff player Min can ensure irrespective of player Max’s strategy. Similarly, the lower value . Dohmen and A. Trivedi 5 Val ( G , s ) of a state s ∈ S is the maximum payoff player Max can ensure irrespective of playerMin’s strategy, i.e., Val ( G , s ) = inf ν ∈ Σ Min sup χ ∈ Σ Max E ν,χs ( Payoff ) and

Val ( G , s ) = sup χ ∈ Σ Max inf ν ∈ Σ Min E ν,χs ( Payoff ) . The inequality

Val ( G , s ) ≤ Val ( G , s ) holds of all two-player zero-sum games. A game is determined when, for every state s ∈ S , its lower value and its upper value are equal; wethen say that the value of the game Val exists and

Val ( G , s ) = Val ( G , s ) = Val ( G , s ) for every s ∈ S . For a strategy ν ∈ Σ Min of player Min and similarly, for a strategy χ ∈ Σ Max of playerMax, we define their values

Val ν and Val χ as Val ν : s sup χ ∈ Σ Max E ν,χs ( Payoff ) and

Val χ : s inf ν ∈ Σ Min E ν,χs ( Payoff ) . We say that a positional strategy ν ∗ of player Min is optimal if Val ν ∗ = Val . Similarly,a positional strategy χ ∗ of player Max is optimal if Val χ ∗ = Val . We say that a game is positionally determined if both players have positional optimal strategies.The seminal work of Shapley [25] established the positional determinacy of the stochasticgames with discounted payoff functions. This was extended by Bewley and Kohlberg [6] forstochastic games with the mean payoff objective. ▶ Theorem 4 (Positional Determinacy) . Stochastic discounted games [25] and stochastic meanpayoff games [6] are determined in (mixed) positional strategies. Both of these games aredetermined in pure and positional strategies for turn-based stochastic game arenas [15].

The exact computational complexity for stochastic games with discounted and meanpayoff objectives is not known. The following summarizes the best known complexity results. ▶ Theorem 5 (Complexity) . Stochastic discounted games are in the

FIXP class [14] and are

SQRT-SUM -hard. Stochastic mean payoff games are in

EXPTIME and are

PTIME -hard[11]. For turn-based SGAs, both types of games are in NP ∩ coNP [4]. Stochastic mean payoff games are intimately connected to stochastic discounted games,as evidenced by the subsequent “Tauberian” theorem, due to Bewley and Kohlberg [5],connecting the two payoffs asymptotically. ▶ Theorem 6 (Tauberian Theorem) . As λ tends to from below we have that the equation Val ( M , s ) = lim λ ↑ (1 − λ ) · Val ( D ( λ ) , s )) holds for every state s ∈ S . Moreover, when λ is sufficiently close to 1, strategies optimal for D ( λ ) are also optimal for M . In particular, for all λ greater than − (( n !) n +3 M n ) − where n is the number of states in the SGA and all transition probabilities are rational withnumerator and denominator bounded from above by M [19, 17, 4]. For an infinite bounded sequence of weights w = ( w , w , . . . ), we consider the γ - forgetful reward sequence (cid:0)P nk =0 γ n − k w k (cid:1) n ≥ rather than its limit, and study (future) discountedpayoff and mean payoff of the γ -forgetful reward sequence. We are interested in characterizingthe optimal strategy for the agent that, irrespective of the choices made by an adversarialenvironment, maximizes the discounted or average-sum of the past-discounted performances. Discounting the Past in Stochastic Games ▶ Definition 7 (Forgetful Objectives) . We consider the following forgetful variants of thediscounted and mean payoff objectives for reward sequences in a SGA G . Forgetful Discounted Payoff.

The λ -discounted payoff for the γ -forgetful reward sequence isthe function FDisc ( γ, λ ) : Plays → R , defined as follows for discount factors λ, γ ∈ [0 , . FDisc ( γ, λ ) def = ( s , ( a , b ) , s , . . . ) lim n →∞ n X k =0 λ k · k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! . Forgetful Mean Payoff.

The mean payoff for a γ -forgetful reward sequence is the function FMean : P lays → R defined as follows, for γ ∈ [0 , . FMean def = ( s , ( a , b ) , s , . . . ) lim inf n →∞ n + 1 n X k =0 k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! . In this paper, we reserve the use of γ ∈ [0 ,

1) for the past-discount factors and λ ∈ [0 , γ and λ , as a stochastic forgetful discounted game (FDG) FD ( γ, λ ) = ( G , γ, λ ). Similarly, we call an SGA with a forgetful mean payoff as a stochasticforgetful mean payoff game (FMG) FM ( γ ) = ( G , γ ). The concepts of the upper value, lowervalue, value, determinacy, and optimal strategies for forgetful discounted and forgetful meanpayoff games are defined analogously to the other objectives defined in the previous section.The main result of this work is summarized by the following theorem. ▶ Theorem 8 (Positional Determinacy) . Stochastic forgetful discounted games and stochasticforgetful mean payoff games are determined in (mixed) positional strategies. For perfect-information game arenas pure and positional strategies suffice for optimality.

Our proof technique involves reducing forgetful games with discounted and mean payoffobjectives to classical discounted and mean payoff games on the same arena with modifiedweight structures. This reduction enables us to derive the positional determinacy as well asthe complexity (Theorem 5) and the Tauberian result (Theorem 6) in the setting of forgetfulgames. The next two sections are devoted to these reductions.

We show that stochastic forgetful discounted games are determined in positional strategiesby reducing the problem to computing optimal strategies for a standard discounted game onthe same arena. Our primary tool in showing the reduction is the following classical theoremfrom real analysis (cf. [24], for instance). ▶ Theorem 9 (Mertens’ Theorem) . Let a = ( a , a , . . . ) and b = ( b , b , . . . ) be two sequences,and let c = ( c , c , . . . ) be the Cauchy product c n = P nk =0 a k b n − k of a and b . If lim n →∞ P nk =0 a k = A and lim n →∞ P nk =0 b k = B and at least one of the sequences ( P nk =0 a k ) n ≥ and ( P nk =0 a k ) n ≥ converges absolutely, then lim n →∞ P Nn =0 c n = A · B . ▶ Theorem 10 (From Forgetful Discounted Games to Discounted Games) . Given a stochasticgame arena G = ( S, A

Min , A

Max , w, p ) , a starting state s ∈ S , discount factors γ, λ ∈ [0 , ,and a pair of strategies ( ν, χ ) ∈ Σ Min × Σ Max , it holds that E ν,χs ( FDisc ( γ, λ )) = 11 − γλ E ν,χs ( Disc ( λ )) . . Dohmen and A. Trivedi 7 Proof.

To prove this theorem, it suffices to show that for every play π = ( s , ( a , b ) , s , . . . ) ∈ Plays G we have that: FDisc ( γ, λ )( π ) = 11 − γλ Disc ( λ )( π ) . This can be shown by observing the following sequence of equalities:

FDisc ( γ, λ )( π ) = lim n →∞ n X k =0 λ k · k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! = lim n →∞ n X k =0 k X i =0 λ k · γ k − i · w ( s i , a i +1 , b i +1 ) ! = lim n →∞ n X k =0 k X i =0 ( γλ ) k − i · ( λ i · w ( s i , a i +1 , b i +1 )) ! = lim n →∞ n X k =0 ( γλ ) k · lim n →∞ n X k =0 λ k · w ( s k , a k +1 , b k +1 ))= 11 − γλ · Disc ( λ )( π ) . The first and the last equalities follow from the definition and the formula for the geometricseries. The second and the third equalities are straightforward. The fourth equality followsfrom Mertens’ theorem and the fact that (cid:16)P ki =0 ( γλ ) k − i · ( λ i · w ( s i , a i +1 , b i +1 )) (cid:17) k ≥ is theCauchy product of the sequences (cid:0) ( γλ ) k (cid:1) k ≥ and (cid:0) ( λ k w ( s k , a k +1 , b k +1 )) (cid:1) k ≥ . ◀▶ Theorem 11.

Stochastic forgetful discounted games are determined in positional strategies.

Proof.

The determinacy of the stochastic forgetful discounted games is a straightforwardconsequence of Theorem 10, asinf ν ∈ Σ Min sup χ ∈ Σ Max E ν,χs ( FDisc ( γ, λ )) = 11 − γλ inf ν ∈ Σ Min sup χ ∈ Σ Max E ν,χs ( Disc ( λ ))= 11 − γλ sup χ ∈ Σ Max inf ν ∈ Σ Min E ν,χs ( Disc ( λ ))= sup χ ∈ Σ Max inf ν ∈ Σ Min E ν,χs ( FDisc ( γ, λ )) . The first and the third equalities follow from Theorem 10 and the second equality followsfrom the determinacy of the stochastic discounted games. Following similar argument, onecan show that stochastic forgetful discounted games are positionally determined (due toTheorem 4) and for perfect information arenas pure and positional strategies suffice. ◀ This section takes a similar approach to Section 4 to show that forgetful mean payoff gamesare positionally determined. However, the following theorem, analogous to Theorem 10, isproven via alternate techniques, as Mertens’ theorem fails to apply. ▶ Theorem 12 (From Forgetful Mean Payoff Games to Mean Payoff Games) . Given a stochasticgame arena G = ( S, A

Min , A

Max , w, p ) , a starting state s ∈ S , and a pair of strategies ( ν, χ ) ∈ Σ Min × Σ Max , it holds that E ν,χs ( FMean ( γ )) = 11 − γ E ν,χs ( Mean ) . Discounting the Past in Stochastic Games

Proof.

To prove this theorem, it suffices to show that for every play π = ( s , ( a , b ) , s , . . . ) ∈ Plays G we have that: FMean ( γ )( π ) = 11 − γ Mean ( π ) . This can be shown by observing the following sequence of equalities:

FMean ( γ )( π ) = lim inf n →∞ n + 1 n X k =0 k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! = lim inf n →∞ n X k =0 w ( s k , a k +1 , b k +1 )(1 − γ n +1 − k )( n + 1)(1 − γ )= lim inf n →∞ n X k =0 w ( s k , a k +1 , b k +1 )( n + 1)(1 − γ ) − n X k =0 w ( s k , a k +1 , b k +1 ) γ n +1 − k )( n + 1)(1 − γ ) ! = lim inf n →∞ n X k =0 w ( s k , a k +1 , b k +1 )( n + 1)(1 − γ ) ! = 11 − γ · Mean ( π ) . The first and the last equalities follow from definition of the payoff functions

FMean and

Mean .The second equality is shown in Lemma 13. The fourth equality follows from Lemma 14 andthe super-additivity of lim inf. The third equality is straightforward. ◀▶ Lemma 13.

For every play π = ( s , ( a , b ) , s , . . . ) ∈ Plays G and prefix n ≥ we have n + 1 n X k =0 k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! = n X k =0 w ( s k , a k +1 , b k +1 )(1 − γ n +1 − k )( n + 1)(1 − γ ) . Proof.

We proceed by induction on n . (cf. Appendix A). ◀▶ Lemma 14.

For every play π = ( s , ( a , b ) , s , . . . ) ∈ Plays G the following equation holds: lim inf n →∞ n X k =0 w ( s k , a k +1 , b k +1 ) γ n +1 − k ( n + 1)(1 − γ ) = 0 . Proof.

By assumption the sequence w ( π ) = ( w ( s n , a n +1 , b n +1 )) n ≥ is bounded, and so thevalues sup w ( π ) and inf w ( π ) are well-defined as greatest and least scalar values occurringanywhere in w ( π ). The following derivation shows that P nk =0 w ( s k ,a k +1 ,b k +1 ) γ n +1 − k ( n +1)(1 − γ ) is bounded For infinite sequences ( a n ) n ≥ and ( b n ) n ≥ the inequality lim inf n →∞ ( a n + b n ) ≥ lim inf n →∞ a n + lim inf n →∞ b n holds.Furhtermore, if either sequence converges, then lim inf n →∞ ( a n + b n ) = lim inf n →∞ a n + lim inf n →∞ b n . . Dohmen and A. Trivedi 9 above by zero.lim n →∞ n X k =0 (sup w ( π )) γ n +1 − k ( n + 1)(1 − γ ) = 11 − γ lim n →∞ n + 1)(1 − γ ) n X k =0 (sup w ( π )) γ n +1 − k ! = 11 − γ lim n →∞ (cid:18)(cid:18) n + 1 (cid:19) (cid:18) (sup w ( π ))(1 − γ n +1 )1 − γ (cid:19)(cid:19) = 11 − γ lim n →∞ (cid:18) n + 1 (cid:19) · lim n →∞ (cid:18) (sup w ( π ))(1 − γ n +1 )1 − γ (cid:19) = 11 − γ · · sup w ( π )1 − γ = 0Similarly, we may obtain an identical equation using inf w ( π ):lim n →∞ n X k =0 (inf w ( π )) γ n +1 − k ( n + 1)(1 − γ ) = 0showing that P nk =0 w ( s k ,a k +1 ,b k +1 ) γ n +1 − k ( n +1)(1 − γ ) is bounded below by zero as well. Since this sequenceis bounded on both sides by zero, it must converge to zero. ◀▶ Theorem 15.

Stochastic forgetful mean-payoff games are determined in positional strategies.

Proof.

This proof is similar to that of Theorem 10. The determinacy of stochastic forgetfulmean-payoff games is a straightforward consequence of Theorem 12, asinf ν ∈ Σ Min sup χ ∈ Σ Max E ν,χs ( FMean ( γ )) = 11 − γ inf ν ∈ Σ Min sup χ ∈ Σ Max E ν,χs ( Mean )= 11 − γ sup χ ∈ Σ Max inf ν ∈ Σ Min E ν,χs ( Mean )= sup χ ∈ Σ Max inf ν ∈ Σ Min E ν,χs ( FMean ( γ )) . The first and the third equalities follow from Theorem Theorem 12 and the second equalityfollows from the determinacy of the stochastic discounted games. Following similar argument,one can show that stochastic forgetful mean-payoff games are positionally determined (dueto Theorem 6) and for perfect information arenas pure and positional strategies suffice foroptimality. ◀ As a consequence of Theorems 5, 10, and 12, we have the following complexity results forcomputing the values of forgetful games. ▶ Theorem 16 (Complexity) . Stochastic forgetful discounted games are in the

FIXP classand are

SQRT-SUM -hard. Stochastic forgetful mean-payoff games are in

EXPTIME and are

PTIME -hard. For perfect information SGAs, both stochastic forgetful discounted games andstochastic forgetful mean-payoff games are in NP ∩ coNP . Additionally, the results from the previous two sections, along with the classical results fordiscounted and mean-payoff games allow us to extend the Tauberian theorem of [5], relatingasymptotically the values of stochastic forgetful discounted games and stochastic forgetfulmean-payoff games. ▶ Theorem 17.

For any state s ∈ S and strategies χ ∈ Σ Max and ν ∈ Σ Min , it holds that

Val ( FM ( γ ) , s ) = lim λ ↑ (1 − γ ) · Val ( FD ( γ, λ ) , s ) . Proof.

By Theorem 10, we obtain the equationlim λ ↑ (1 − γ ) Val ( FD ( γ, λ ) , s ) = lim λ ↑ (1 − λ ) Val ( D ( λ ) , s )1 − γλ . Applying Theorem 6 to the right-hand side of this equation leads to the following:lim λ ↑ (1 − γ ) Val ( FD ( γ, λ ) , s ) = Val ( M )1 − γ . Finally, applying Theorem 12 yields the desired equivalence:lim λ ↑ (1 − γ ) Val ( FD ( γ, λ ) , s ) = Val ( FM ( γ ) , s ) . The proof is now complete. ◀ The discounted and mean-payoff objectives have played central roles in the theory of stochasticgames. Consequently, a multitude of deep results exist connecting these objectives [5, 6, 21,4, 8, 10, 28, 27, 29] in addition to an extensive body of work related to algorithms for andthe computational complexity of solving these games [16, 22, 23, 11, 14, 9].Past discounted sum of finite sequences was studied in the context of optimization by Aluret al. [3]. The past-discounted payoff is also closely related to exponential recency weightedaverage technique used in nonstationary multi-armed bandit problems [26] to estimate theaverage reward of different actions by giving more weight to recent outcomes. However, to thebest of our knowledge, past-discounting has not yet been formally studied as a payoff functionin stochastic games. The main contribution of this work is defining and studying the notionof past-discounting in the more general context of long-run and infinite duration optimizationproblems and games. Our results characterize an elegant connection between forgetfuldiscounted and mean-payoff games with their classical counterparts. Characterizing similarconnections between the classical and the past versions of other popular objectives—such aslim inf and lim sup—remains an interesting open problem.Discounted objectives have found significant applications in areas of program verificationand synthesis [13, 7]. Relatively recently—although the idea of past operators is quite old[18]—a number of classical formalisms including temporal logics such as LTL and CTL andthe modal µ -calculus have been extended with past-tense operators and with discountedquantitative semantics [12, 1, 2]. A particularly significant result [20] around LTL withclassical boolean semantics is that, while LTL with past operators is no more expressivethan standard LTL, it is exponentially more succinct. It remains open whether this type ofrelationship holds between logics and their extension by past operators holds when interpretedwith discounted quantitative semantics [2]. Regret minimization is a popular criterion in the setting of online learning where adecision-maker chooses her actions so as to minimize the average regret —the differencebetween the realized reward and the reward that could have been achieved. We argue thatimperfect decision makers may view their regret in a past-discounted sense—a suboptimalaction in a recent past tend to cause more regret than an equally suboptimal action in remotepast. We hope that the results of this work spur further interest in developing foundationsof past-discounted characterizations of regret in online learning and optimization. . Dohmen and A. Trivedi 11

References Shaull Almagor, Udi Boker, and Orna Kupferman. Discounting in LTL. In

Tools and Algorithmsfor the Construction and Analysis of Systems, TACAS , volume 8413 of

LNCS , pages 424–439.Springer, 2014. Shaull Almagor, Udi Boker, and Orna Kupferman. Formally reasoning about quality.

J. ACM ,63(3):24:1–24:56, 2016. Rajeev Alur, Loris D’Antoni, Jyotirmoy V. Deshmukh, Mukund Raghothaman, and YifeiYuan. Regular functions, cost register automata, and generalized min-cost problems, 2012. Daniel Andersson and Peter Bro Miltersen. The complexity of solving stochastic gameson graphs. In

Algorithms and Computation ISAAC , volume 5878 of

LNCS , pages 112–121.Springer, 2009. Truman Bewley and Elon Kohlberg. The asymptotic theory of stochastic games.

Mathematicsof Operations Research , 1(3):197–208, 1976. Truman Bewley and Elon Kohlberg. On stochastic games with stationary optimal strategies.

Mathematics of Operations Research , 3(2):104–125, 1978. Pavol Cerný, Krishnendu Chatterjee, Thomas A. Henzinger, Arjun Radhakrishna, and RohitSingh. Quantitative synthesis for concurrent programs. In Ganesh Gopalakrishnan and ShazQadeer, editors,

Computer Aided Verification CAV , volume 6806 of

Lecture Notes in ComputerScience , pages 243–259. Springer, 2011. Krishnendu Chatterjee, Laurent Doyen, and Rohit Singh. On memoryless quantitativeobjectives. In Olaf Owe, Martin Steffen, and Jan Arne Telle, editors,

Fundamentals ofComputation Theory FCT , volume 6914 of

Lecture Notes in Computer Science , pages 148–159.Springer, 2011. Krishnendu Chatterjee and Rasmus Ibsen-Jensen. Qualitative analysis of concurrent mean-payoff games.

Inf. Comput. , 242:2–24, 2015. Krishnendu Chatterjee and Rupak Majumdar. Discounting and averaging in games acrosstime scales.

Int. J. Found. Comput. Sci. , 23(3):609–625, 2012. Krishnendu Chatterjee, Rupak Majumdar, and Thomas A. Henzinger. Stochastic limit-averagegames are in EXPTIME.

Int. J. Game Theory , 37(2):219–234, 2008. Luca de Alfaro, Marco Faella, Thomas A. Henzinger, Rupak Majumdar, and Mariëlle Stoelinga.Model checking discounted temporal properties.

Theor. Comput. Sci. , 345(1):139–170, 2005. Luca de Alfaro, Thomas A. Henzinger, and Rupak Majumdar. Discounting the future insystems theory. In

Automata, Languages and Programming ICALP , volume 2719 of

LNCS ,pages 1022–1037. Springer, 2003. Kousha Etessami and Mihalis Yannakakis. On the complexity of nash equilibria and otherfixed points.

SIAM J. Comput. , 39(6):2531–2597, 2010. Jerzy Filar and Koos Vrieze.

Competitive Markov Decision Processes . Springer-Verlag, Berlin,Heidelberg, 1996. Jerzy A. Filar and Todd A. Schultz. Nonlinear programming and stationary strategies instochastic games.

Math. Program. , 34(2):243–247, 1986. A. Hordijk and A. A. Yushkevich.

Handbook of Markov Decision Processes: Methods andApplications , chapter Blackwell Optimality, pages 231–267. Springer, 2002. Orna Lichtenstein, Amir Pnueli, and Lenore D. Zuck. The glory of the past. In

Logics ofPrograms, Conference , volume 193 of

LNCS , pages 196–218. Springer, 1985. T. M. Liggett and S. A. Lippman. Short notes: Stochastic games with perfect informationand time average payoff.

SIAM Review , 11(4):604–607, 1969. Nicolas Markey. Temporal logic with past is exponentially more succinct, concurrency column.

Bull. EATCS , 79:122–128, 2003. J.F. Mertens and Abraham Neyman. Stochastic games.

International Journal of Game Theory ,10(2):53–66, 1981. T. E. S. Raghavan and Jerzy A. Filar. Algorithms for stochastic games - A survey.

ZORMethods Model. Oper. Res. , 35(6):437–472, 1991. T. E. S. Raghavan and Zamir Syed. A policy-improvement type algorithm for solving zero-sumtwo-person stochastic games of perfect information.

Math. Program. , 95(3):513–532, 2003. Walter Rudin.

Principles of mathematical analysis , volume 3. McGraw-hill New York, 1964. L. S. Shapley. Stochastic games.

Proceedings of the National Academy of Sciences , 39(10):1095–1100, 1953. Richard S. Sutton and Andrew G. Barto.

Reinforcement learning - an introduction . Adaptivecomputation and machine learning. MIT Press, 1998. Bruno Ziliotto. General limit value in zero-sum stochastic games.

Int. J. Game Theory ,45(1-2):353–374, 2016. Bruno Ziliotto. A tauberian theorem for nonexpansive operators and applications to zero-sumstochastic games.

Mathematics of Operations Research , 41(4):1522–1534, 2016. Bruno Ziliotto. Tauberian theorems for general iterations of operators: Applications tozero-sum stochastic games.

Games Econ. Behav. , 108:486–503, 2018.

A Proof of Lemma 13

Base case:

Lemma 13 holds for n = 0 . Let n = 0, then1 n + 1 n X k =0 k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! = w ( s , a , b ) = n X k =0 w ( s k , a k +1 , b k +1 )(1 − γ n +1 − k )( n + 1)(1 − γ ) . Inductive case:

If Lemma 13 holds for n − , then it also holds for n . Firstly, observe the following derivation:1 n + 1 n X k =0 k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! = n X k =0 n + 1 k X i =0 γ k − i w ( s i , a i +1 , b i +1 )= n − X k =0 n + 1 k X i =0 γ k − i w ( s k , a k +1 , b k +1 ) ! + 1 n + 1 n X k =0 γ n − k w ( s k , a k +1 , b k +1 )= nn + 1 n − X k =0 n k X i =0 γ k − i w ( s k , a k +1 , b k +1 ) ! + 1 n + 1 n X k =0 γ n − k w ( s k , a k +1 , b k +1 )Notice that the expression within the parentheses matches exactly the left-hand side ofLemma 13 for n −

1. By the inductive hypothesis, we may rewrite this as follows:1 n + 1 n X k =0 k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! = nn + 1 n − X k =0 w ( s k , a k +1 , b k +1 )(1 − γ n − k ) n (1 − γ ) ! + n + 1 n X k =0 w ( s k , a k +1 , b k +1 ) γ n − k ! . On the right-hand side, the n terms in the left summand cancel, and factoring (1 − γ ) out ofthe denominator leaves − γ n − X k =0 w ( s k , a k +1 , b k +1 )(1 − γ n − k ) n + 1 ! + n X k =0 w ( s k , a k +1 , b k +1 ) γ n − k n + 1 ! . . Dohmen and A. Trivedi 13 Now, factoring again by − γ and by n +1 yields (cid:18) n − P k =0 w ( s k , a k +1 , b k +1 )(1 − γ n − k ) (cid:19) + (cid:18) n P k =0 w ( s k , a k +1 , b k +1 ) γ n − k (1 − γ ) (cid:19) ( n + 1)(1 − γ ) . Multiplying the terms within the summations in the numerator, we get the followingexpression: (cid:18) n − P k =0 w ( s k , a k +1 , b k +1 ) − w ( s k , a k +1 , b k +1 ) γ n − k (cid:19) ( n + 1)(1 − γ )+ (cid:18) n P k =0 w ( s k , a k +1 , b k +1 ) γ n − k − w ( s k , a k +1 , b k +1 ) γ n +1 − k (cid:19) ( n + 1)(1 − γ ) . Taking advantage of additive cancelation in the numerator, we obtain (cid:18) n − P k =0 w ( s k , a k +1 , b k +1 ) (cid:19) + w ( s n , a n +1 , b n +1 ) − (cid:18) n P k =0 w ( s k , a k +1 , b k +1 ) γ n +1 − k (cid:19) ( n + 1)(1 − γ ) . With some simple regrouping of terms, the above may be rewritten as (cid:18) n P k =0 w ( s k , a k +1 , b k +1 ) (cid:19) − (cid:18) n P k =0 w ( s k , a k +1 , b k +1 ) γ n +1 − k (cid:19) ( n + 1)(1 − γ ) , and subsequently n P k =0 w ( s k , a k +1 , b k +1 )(1 − γ n +1 − k )( n + 1)(1 − γ ) . Finally, by the linearity of summation, we obtain the desired equality:1 n + 1 n X k =0 k X i =0 γ k − i · w ( s i , a i +1 , b i +1 ) ! = n X k =0 w ( s k , a k +1 , b k +1 )(1 − γ n +1 − k )( n + 1)(1 − γ ))