Arena-Independent Finite-Memory Determinacy in Stochastic Games
Patricia Bouyer, Youssouf Oualhadj, Mickael Randour, Pierre Vandenhove
aa r X i v : . [ c s . G T ] F e b Arena-Independent Finite-Memory Determinacyin Stochastic Games ⋆ Patricia Bouyer , Youssouf Oualhadj , Mickael Randour , and Pierre Vandenhove , LMF, Universit´e Paris-Saclay, CNRS, ENS Paris-Saclay, France LACL – Universit´e Paris Est Cr´eteil, Cr´eteil, France F.R.S.-FNRS & UMONS – Universit´e de Mons, Belgium
Abstract.
We study stochastic zero-sum games on graphs, which are prevalent tools to modeldecision-making in presence of an antagonistic opponent in a random environment. In this setting,an important question is the one of strategy complexity : what kinds of strategies are sufficientor required to play optimally (e.g., randomization or memory requirements)? Our contributionsfurther the understanding of arena-independent finite-memory (AIFM) determinacy , i.e., the studyof objectives for which memory is needed, but in a way that only depends on limited parametersof the game graphs. First, we show that objectives for which pure AIFM strategies suffice to playoptimally also admit pure AIFM subgame perfect strategies. Second, we show that we can reducethe study of objectives for which pure AIFM strategies suffice in two-player stochastic gamesto the easier study of one-player stochastic games (i.e., Markov decision processes). Third, we characterize the sufficiency of AIFM strategies through two intuitive properties of objectives. Thiswork extends a line of research started on deterministic games in [BLO +
20] to stochastic ones.
Keywords: two-player games on graphs, stochastic games, Markov decision processes, finite-memorydeterminacy, optimal strategies
Controller synthesis consists, given a system, an environment, and a specification, in automaticallygenerating a controller of the system that guarantees the specification in the environment. This taskis often studied through a game-theoretic lens: the system is a game, the controller is a player, theuncontrollable environment is its adversary, and the specification is a game objective [Ran13]. A gameon graph consists of a directed graph, called an arena , partitioned into two kinds of vertices: someof them are controlled by the system (called player
1) and the others by the environment (called player game objective (corresponding to the specification) and must devise a strategy (corresponding to the controller) to accomplish the objective or optimize an outcome. Thestrategy can be seen as a function that dictates the decisions to make in order to react to every possiblechain of events. In case of uncertainty in the system or the environment, probability distributions areoften used to model transitions in the game graph, giving rise to the stochastic game model. We studyhere stochastic turn-based zero-sum games on graphs [Con92], also called perfect-information stochasticgames. We also discuss the case of deterministic games , which can be seen as a subcase of stochasticgames in which no probability distributions are used in transitions.
Strategy complexity.
A common question underlying all game objectives is the one of strategy complexity :how complex must optimal strategies be, and how simple can optimal strategies be? For each distinctgame objective, multiple directions can be investigated, such as the need for randomization [CDGH10](must optimal strategies make stochastic choices?), the need for memory [GZ05,GZ09,BLO +
20] (how ⋆ Research supported by F.R.S.-FNRS under Grant n ◦ F.4520.18 (ManySynth), and ENS Paris-Saclay visit-ing professorship (M. Randour, 2019). Mickael Randour is an F.R.S.-FNRS Research Associate and PierreVandenhove is an F.R.S.-FNRS Research Fellow. uch information about the past must optimal strategies remember?), or what trade-offs exist betweenrandomization and memory [CdAH04,CRR14,MPR20]. With respect to memory requirements, threecases are typically distinguished: memoryless-determined objectives , for which memoryless strategiessuffice to play optimally; finite-memory-determined objectives , for which finite-memory strategies suffice(memory is then usually encoded as a deterministic finite automaton); and objectives for which infinitememory is required. High memory requirements (such as exponential memory and obviously infinitememory) are a major drawback when it comes to implementing controllers; hence specific approachesare often developed to look for simple strategies (e.g., [DKQR20]).Many classical game objectives (reachability [Con92], B¨uchi and parity [CJH04], energy [BBE10],discounted sum [Sha53]. . . ) are memoryless-determined, both in deterministic and stochastic arenas.Nowadays, multiple general results allow for a more manageable proof for most of these objectives: wemention [Kop06,BFMM11,AR17] for sufficient conditions in deterministic games, and [Gim07,GK14]for similar conditions in one-player and two-player stochastic games. One milestone for memorylessdeterminacy in deterministic games was achieved by Gimbert and Zielonka [GZ05], who provide twocharacterizations of it: the first one states two necessary and sufficient conditions (called monotony and selectivity ) for memoryless determinacy, and the second one states that memoryless determinacy in bothplayers’ one-player games suffices for memoryless determinacy in two-player games (we call this resultthe one-to-two-player lift ). Together, these characterizations provide a theoretical and practical advance.On the one hand, monotony and selectivity improve the high-level understanding of what conditionswell-behaved objectives verify. On the other hand, only having to consider the one-player case thanksto the one-to-two-player lift is of tremendous help in practice. A generalization of the one-to-two-playerlift to stochastic games was shown also by Gimbert and Zielonka in an unpublished paper [GZ09] andis about memoryless strategies that are pure (i.e., not using randomization).
The need for memory.
Recent research tends to study increasingly complex settings — such as combina-tions of qualitative/quantitative objectives or of behavioral models — for which finite or infinite mem-ory is often required; see examples in deterministic games [CD12,VCD + + + finite-memorydeterminacy . Proving finite-memory determinacy is sometimes difficult (already in deterministic games,e.g., [BHM + Arena-independent finite-memory.
An interesting middle ground between the well-understood memory-less determinacy and the more puzzling finite-memory determinacy was proposed for deterministic gamesin [BLO +
20] with the formalization of arena-independent finite-memory (AIFM) strategies : strategiesthat may use memory, but given an objective, for which a single memory structure suffices to play opti-mally in any arena. In practice, this memory structure may depend on parameters of the objective (forinstance, largest weight, number of priorities), but not on parameters intrinsically linked to the arena(e.g., number of states or transitions). In stochastic games, objectives for which AIFM strategies notablysuffice include the aforementioned memoryless-determined objectives (a very simple memory structurewith a single state suffices), generalized reachability games [CKWW20] (the memory depends on thenumber of objectives, but not on the size of the arena), ω -regular games [CH12] (the memory depends onthe objective, on the number of priorities). Still, there are objectives for which finite-memory strategiessuffice, but with an underlying memory structure depending on parameters of the arena (examples areprovided in [MSTW21, Theorem 6]).Nevertheless, AIFM strategies have a remarkable feature: in deterministic arenas, AIFM generaliza-tions of both characterizations from [GZ05] hold, including the one-to-two-player lift [BLO + ontributions. We provide an overview of desirable properties of objectives in which pure AIFM strate-gies suffice to play optimally in stochastic games, and tools to study them. This entails: – a proof of a specific feature of objectives for which pure AIFM strategies suffice to play optimally:for such objectives, there also exist pure AIFM subgame perfect strategies (Theorem 22), which is astronger requirement than optimality; – a more general one-to-two-player lift: we show the equivalence between the existence of pure AIFMoptimal strategies in two-player games for both players and the existence of pure AIFM strategiesin one-player games, thereby simplifying the proof of memory requirements for many objectives(Theorem 23); – two intuitive conditions generalizing monotony and selectivity in the stochastic/AIFM case, whichare equivalent to the sufficiency of pure AIFM strategies to play optimally in one-player stochasticarenas (Theorem 32) for all objectives that can be encoded as real payoff functions .In practice, this last theorem can be used to prove memory requirements in one-player arenas, and thenthe second theorem can be used to lift these to the two-player case.These results reinforce both sides on the frontier between AIFM strategies and general finite-memorystrategies: on the one hand, objectives for which pure AIFM strategies suffice indeed share interestingproperties with objectives for which pure memoryless strategies suffice, rendering their analysis easier,even in the stochastic case; on the other hand, our novel result about SP strategies does not hold for(arena- dependent ) finite-memory strategies, and therefore further distinguishes the AIFM case from thefinite-memory case.The one-to-two-player lift for pure AIFM strategies in stochastic games is not surprising, as it holdsfor pure memoryless strategies in stochastic games [GZ09], and for AIFM strategies in deterministicgames [BLO + + + +
20] paved the way toneatly overcome this technical hindrance and we were able to factorize the main argument in Lemma 21.
Outline.
We introduce our framework and notations in Section 2. We discuss AIFM strategies andtools to relate them to memoryless strategies in Section 3, which allows us to prove our result aboutsubgame perfect strategies. The one-to-two-player lift is presented in Section 4, followed by the one-player characterization in Section 5. We provide an illustrative application of our results in Section 6.
Let C be an arbitrary set of colors . Probabilities.
For a measurable space ( Ω, F ) (resp. a finite set Ω ), we write Dist ( Ω, F ) (resp. Dist ( Ω ))for the set of probability distributions on ( Ω, F ) (resp. on Ω ). For Ω a finite set and µ ∈ Dist ( Ω ), wewrite Supp ( µ ) = { ω ∈ Ω | µ ( ω ) > } for the support of µ .3 renas. We consider stochastic games played by two players, called P (for player 1) and P (forplayer 2), who play in a turn-based fashion on arenas . Definition 1 (Arena).
A (two-player stochastic turn-based) arena is a tuple A = ( S , S , A, δ, col ) ,where: – S and S are two disjoint finite sets of states , respectively controlled by P and P — we denote S = S ⊎ S for the union of all states; – A is a finite set of actions ; – δ : S × A → Dist ( S ) is a partial function called probabilistic transition function ; – col : S × A → C is a partial function called coloring function .For a state s ∈ S , we write A ( s ) for the set of actions that are available in s , that is, the set of actionsfor which δ ( s, a ) is defined. For s ∈ S , function col must be defined for all pairs ( s, a ) such that a isavailable in s . We require that for all s ∈ S , A ( s ) = ∅ . The last condition ensures that there is at least one available action in every state (i.e., arenas are non-blocking ). For s, s ′ ∈ S and a ∈ A ( s ), we usually denote δ ( s, a, s ′ ) instead of δ ( s, a )( s ′ ) for theprobability to reach s ′ in one step by playing a in s , and we write ( s, a, s ′ ) ∈ δ if and only if δ ( s, a, s ′ ) >
0. An interesting subclass of (stochastic) arenas is the class of deterministic arenas: an arena A =( S , S , A, δ, col ) is deterministic if for all s ∈ S , a ∈ A ( s ), | Supp ( δ ( s, a )) | = 1.Let A = ( S , S , A, δ, col ) be an arena. A play of A is an infinite sequence of states and actions s a s a s . . . ∈ ( SA ) ω such that for all i ≥
0, ( s i , a i +1 , s i +1 ) ∈ δ . The set of all plays starting ina state s ∈ S is denoted Plays ( A , s ). A prefix of a play is an element in S ( AS ) ∗ and is called a history ; the set of all histories starting in a state s ∈ S is denoted Hists ( A , s ). For S ′ ⊆ S , we write Plays ( A , S ′ ) (resp. Hists ( A , S ′ )) for the unions of Plays ( A , s ) (resp. Hists ( A , s )) over all states s ∈ S ′ .For ρ = s a s . . . a n s n a history, we write out ( ρ ) for s n . For i ∈ { , } , we write Hists i ( A , s ) and Hists i ( A , S ′ ) for the corresponding histories ρ such that out ( ρ ) ∈ S i . For s ∈ S (resp. S ′ ⊆ S ) and s ′ ∈ S , we write Hists ( A , s, s ′ ) (resp. Hists ( A , S ′ , s ′ )) for the histories in Hists ( A , s ) (resp. Hists ( A , S ′ ))such that out ( ρ ) = s ′ .We write c col for the extension of col to histories and plays: more precisely, for a history ρ = s a s . . . a n s n , we write c col ( ρ ) for the finite sequence col ( s , a ) . . . col ( s n − , a n ) ∈ C ∗ ; for π = s a s a s . . . a play, we write c col ( π ) for the infinite sequence col ( s , a ) col ( s , a ) . . . ∈ C ω .A one-player arena of P i is an arena A = ( S , S , A, δ, col ) such that for all s ∈ S − i , | A ( s ) | = 1.A one-player arena in our context corresponds to the notion of Markov decision process (MDP) oftenfound in the literature [Put94,BK08].For technical reasons that will be further justified later, we will usually work on arenas where theset of initial states is explicitly specified.
Definition 2 (Initialized arena). An initialized arena is a couple ( A , S init ) such that A is an arenaand S init is a non-empty subset of the states of A , called the set of initial states . We assume w.l.o.g. thatall states of A are reachable from S init following transitions with positive probabilities in the probabilistictransition function of A . If an initialized arena has only one initial state s ∈ S , we write ( A , s ) for ( A , { s } ).We will often compare initialized arenas even if they are not formally defined on the same state spaceby using a natural definition of isomorphism : we say that two initialized arenas (( S , S , A, δ, col ) , S init )and (( S ′ , S ′ , A ′ , δ ′ , col ′ ) , S ′ init ) are isomorphic if there exist a bijection ψ S : S → S ′ and for all s ∈ S ,a bijection ψ sA : A ( s ) → A ′ ( ψ S ( s )) such that ψ S ( S ) = S ′ , ψ S ( S ) = S ′ , ψ S ( S init ) = S ′ init , and for all s , s ∈ S , a ∈ A , we have δ ( s , a )( s ) = δ ′ ( ψ S ( s ) , ψ s A ( a ))( ψ S ( s )) and col ( s , a ) = col ′ ( ψ S ( s ) , ψ s A ( a )).We will consider sets (which we call classes ) of initialized arenas, which are usually denoted by theletter A . Although our results often apply to more fine-grained classes of arenas, typical classes that wewill consider consist of all one-player or two-player, deterministic or stochastic initialized arenas. Weuse initialized arenas throughout the paper for technical reasons, but all of our results can be convertedto results using only the more classical notion of arena.4 emory. To play in games, players use strategies, which can sometimes be efficiently implemented with finite memory . We define a classical notion of memory based on complete deterministic automata on colors . The goal of using colors instead of states/actions for transitions of the memory is to allow todefine memory structures independently of arenas, so that they can be used in all arenas.
Definition 3 (Memory skeleton). A memory skeleton is a tuple M = ( M, m init , α upd ) where M isa set of memory states , m init ∈ M is an initial state and α upd : M × C → M is an update function .We add the following constraint: for all finite sets of colors B ⊆ C , the number of states reachable from m init with transitions provided by α upd | M × B is finite (where α upd | M × B is the restriction of the domainof α upd to M × B ). We slightly relax the usual finiteness constraint for the state space by simply requiring that wheneverrestricted to finitely many colors, the state space of the skeleton is finite. Memory skeletons with a finitestate space are all encompassed by this definition, but this also allows some memory skeletons withinfinitely many states. For example, if C = N , the tuple ( N , , ( m, n ) max { m, n } ), which remembersthe greatest color seen, is a valid memory skeleton: for any finite B ⊆ C , we only need to use memorystates up to max B . However, the tuple ( N , , ( m, n ) m + n ) remembering the current sum of all colorsseen is not a memory skeleton, as infinitely many states are reachable from 0, even if only B = { } canbe used.We denote d α upd : M × C ∗ → M for the natural extension of α upd to finite sequences of colors.It will often be useful to use two memory skeletons in parallel, which is equivalent to using their product . Definition 4 (Product of skeletons).
Let M = ( M , m init , α upd ) and M = ( M , m init , α upd ) betwo memory skeletons. We define their product M ⊗ M as the memory skeleton ( M, m init , α upd ) obtained as follows: M = M × M , m init = ( m init , m init ) , and, for all m ∈ M , m ∈ M , c ∈ C , α upd (( m , m ) , c ) = ( α upd ( m , c ) , α upd ( m , c )) . The update function of the product simply updates both skeletons in parallel.
Definition 5 (Product initialized arenas).
Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arenaand M = ( M, m init , α upd ) be a memory skeleton. We define the product initialized arena ( A , S init ) ⋉ M as the initialized arena (( S ′ , S ′ , A, δ ′ , col ′ ) , S ′ init ) where: – S ′ init = S init × { m init } , – δ ′ : ( S × M ) × A → Dist ( S × M ) is such that for all ( s, m ) ∈ S × M and a ∈ A , δ ′ (( s, m ) , a ) isdefined if and only if δ ( s, a ) is defined, in which case δ ′ (( s, m ) , a, ( s ′ , m ′ )) is equal to δ ( s, a, s ′ ) if α upd ( m, col ( s, a )) = m ′ , and is otherwise — this implies that A (( s, m )) = A ( s ) , – S ′ is the smallest subset of S × M such that S ′ init ⊆ S ′ , and for all ( s, m ) , ( s ′ , m ′ ) ∈ S × M , a ∈ A ,if ( s, m ) ∈ S ′ and δ (( s, m ) , a, ( s ′ , m ′ )) is positive, then ( s ′ , m ′ ) ∈ S ′ ; we define S ′ = S ′ ∩ ( S × M ) and S ′ = S ′ ∩ ( S × M ) , – for all ( s, m ) ∈ S ′ and a ∈ A ( s ) , col ′ (( s, m ) , a ) = col ( s, a ) . A product initialized arena ( A , S init ) ⋉ M is an initialized arena with transitions obtained from A ,with state space enriched with extra information about the current memory state, which is initialized at m init . We only keep states that are reachable from S ′ init following transitions of δ ′ , thereby enforcing thatin initialized arenas, all states in the state space are reachable from the initial states. Even if memoryskeletons have infinitely many states or transitions, product initialized arenas are always finite, as onlyfinitely many colors appear in an initialized arena, and only these colors appear in the product initializedarena. Strategies.
We can now define strategies, which are functions describing what each player does inresponse to every possible scenario.
Definition 6 (Strategy).
Given an initialized arena ( A , S init ) and i ∈ { , } , a strategy of P i on( A , S init ) is a function σ i : Hists i ( A , S init ) → Dist ( A ) such that for all ρ ∈ Hists i ( A , S init ) , Supp ( σ i ( ρ )) ⊆ A ( out ( ρ )) . For i ∈ { , } , we denote by Σ G i ( A , S init ) the set of strategies of P i on ( A , S init ) . G in Σ G i ( A , S init ) stands for “general” as it encompasses all possible strategies; we nowdiscuss interesting subclasses of strategies.A strategy σ i of P i on ( A , S init ) is pure if it does not resort to probability distributions to chooseactions, that is, if for all ρ ∈ Hists i ( A , S init ), | Supp ( σ i ( ρ )) | = 1. If a strategy is not pure, then it is randomized .A strategy σ i of P i on ( A , S init ) is memoryless if every distribution over actions it selects only dependson the current state of the arena, and not on the whole history, that is, if for all ρ, ρ ′ ∈ Hists i ( A , S init ), out ( ρ ) = out ( ρ ′ ) implies σ i ( ρ ) = σ i ( ρ ′ ). A pure memoryless strategy of P i can be simply specified as afunction S i → A .A strategy σ i of P i on ( A , S init ) is finite-memory if it can be encoded as a Mealy machine Γ =( M , α nxt ), with M = ( M, m init , α upd ) being a memory skeleton and α nxt : S i × M → Dist ( A ) being the next-action function , which is such that for s ∈ S i , m ∈ M , Supp ( α nxt ( s, m )) ⊆ A ( s ). Strategy σ i isencoded by Γ if for all histories ρ ∈ Hists i ( A , S init ), σ i ( ρ ) = α nxt ( out ( ρ ) , d α upd ( m init , c col ( ρ ))) . If σ i can be encoded as a Mealy machine ( M , α nxt ), we say that σ i is based on (memory) M . If σ i is based on M and is pure , then the next-action function can be specified as a function S i × M → A .Memoryless strategies correspond to finite-memory strategies based on the trivial memory skeleton M triv = ( { m init } , m init , ( m init , c ) m init ) that has a single state.We denote by Σ PFM i ( A , S init ) (resp. Σ P i ( A , S init ), Σ GFM i ( A , S init ), Σ G i ( A , S init )) the set of pure finite-memory (resp. pure, finite-memory, general) strategies of P i on ( A , S init ). A type of strategies is anelement X ∈ { PFM , P , GFM , G } corresponding to these subsets. Outcomes.
Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arena. For ρ ∈ Hists ( A , S init ), we denote Cyl ( ρ ) = { π ∈ Plays ( A , S init ) | ρ is a prefix of π } for the cylinder of ρ , that is, the set of plays (which are infinite) starting with ρ . We denote by F ( A ,S init ) the smallest σ -algebra generated by all the cylinders of histories in Hists ( A , S init ). Hence,( Plays ( A , S init ) , F ( A ,S init ) ) is a measurable space.When both players have decided on a strategy and an initial state has been chosen, the generatedobject is a (finite or countably infinite) Markov chain, which induces a probability distribution on theplays. More precisely, for strategies σ of P and σ of P on ( A , S init ) and s ∈ S init , we denote P σ ,σ A ,s forthe probability distribution on ( Plays ( A , s ) , F ( A ,s ) ) induced by σ and σ , starting from state s . Thisdistribution is defined on the set of cylinders as follows: for ρ = s a s . . . a n s n ∈ Hists ( A , s ), P σ ,σ A ,s [ Cyl ( ρ )] = n Y j =1 σ i j ( s . . . a j − s j − )( a j ) · δ ( s j − , a j , s j ) , where i j = 1 if s j − ∈ S , and i j = 2 if s j − ∈ S . This pre-measure can be uniquely extendedto ( Plays ( A , s ) , F ( A ,s ) ) by Carath´eodory’s extension theorem [Dur19, Theorem A.1.3], as the class ofcylinders is a semi-ring of sets that generates the whole σ -algebra.Similarly, we define F to be the smallest σ -algebra on C ω generated by the set of all cylinders on C .We can extend c col to distributions over ( Plays ( A , s ) , F ( A ,s ) ): for µ ∈ Dist ( Plays ( A , s ) , F ( A ,s ) ), we write c col ( µ ) for the distribution µ ◦ c col − ∈ Dist ( C ω , F ). In particular, every probability distribution P σ ,σ A ,s naturally induces a probability distribution c col ( P σ ,σ A ,s ) over ( C ω , F ) through the c col function, which wedenote Pc σ ,σ A ,s . Preferences.
To specify each player’s objective, or preference, we use the general notion of preferencerelation . Definition 7 (Preference relation). A preference relation ⊑ (on C ) is a total preorder over Dist ( C ω , F ) . P favors the distributions in Dist ( C ω , F ) that are the largest for ⊑ , and as we arestudying zero-sum games, P favors the distributions that are the smallest for ⊑ . Equivalently, P ’sgoal is to obtain the largest distribution for the inverse preference relation ⊑ − , defined as µ ⊑ − µ ′ ifand only if µ ′ ⊑ µ . For ⊑ a preference relation and µ, µ ′ ∈ Dist ( C ω , F ), we write µ ⊏ µ ′ if µ ⊑ µ ′ and µ ′ µ .Depending on the context, it might not be necessary to define a preference relation as total : it issufficient to order distributions that can arise as an element P σ ,σ A ,s in the context. For example, inthe specific case of deterministic games in which only pure strategies are considered, all distributionsthat arise are always Dirac distributions on a single infinite word in C ω . In this context, it is thereforesufficient to define a total preorder over all Dirac distributions (which we can then see as infinite words,giving a definition of preference relation similar to [GZ05,BLO + Example 8.
We give three examples corresponding to three different ways to encode preference relations.First, a preference relation can be induced by an event W ∈ F called a winning condition , which consistsof infinite sequences of colors. The objective of P is to maximize the probability that the event W happens. An event W naturally induces a preference relation ⊑ W such that for µ, µ ′ ∈ Dist ( C ω , F ), µ ⊑ W µ ′ if and only if µ ( W ) ≤ µ ′ ( W ). For C = N , we give the example of the weak parity winningcondition W wp [Tho08], defined as W wp = { c c . . . ∈ C ω | max j ≥ c j exists and is even } . In finite arenas, the value max j ≥ c j always exists, as there are only finitely many colors that appear.This is different from the classical parity condition, which requires the maximal color seen infinitelyoften to be even, and not just the maximal color seen. In particular, W wp is not prefix-independent.A preference relation can also be induced by a Borel (real) payoff function f : C ω → R . For example,if C = Q and λ ∈ (0 , ∩ Q , a classical payoff function [Sha53] is the discounted sum Disc λ , defined for c c . . . ∈ C ω as Disc λ ( c c . . . ) = lim n n X i =0 λ i · c i +1 . The goal of P is to maximize the expected value of f , which is defined for a probability distribution µ ∈ Dist ( C ω , F ) as E µ [ f ] = R f d µ . A payoff function f naturally induces a preference relation ⊑ f : for µ , µ ∈ Dist ( C ω , F ), µ ⊑ f µ if and only if E µ [ f ] ≤ E µ [ f ]. Payoff functions are more general thanwinning conditions: for W a winning condition, the preference relation induced by the indicator functionof W , which is a payoff function, corresponds to the preference relation induced by W .It is also possible to specify preference relations that cannot be expressed as a payoff function. Anexample is given in [CFK + P is to see color c ∈ C with probability precisely . We denote the event of seeing color c as ♦ c ∈ F . Then for µ, µ ′ ∈ Dist ( C ω , F ), if µ ( ♦ c ) = and µ ′ ( ♦ c ) = , µ ⊏ µ ′ . ⊳ Combining an initialized arena, describing how the players interact with each other, and a preferencerelation, describing both players’ objectives, defines an initialized game . Definition 9 (Initialized game).
A (two-player stochastic turn-based zero-sum) initialized game isa tuple G = ( A , S init , ⊑ ) , where ( A , S init ) is an initialized arena and ⊑ is a preference relation.Optimality of strategies. Let G = ( A , S init , ⊑ ) be an initialized game and X ∈ { PFM , P , GFM , G } be atype of strategies. For s ∈ S init , σ ∈ Σ X ( A , S init ), we define UCol X ⊑ ( A , s, σ ) = { µ ∈ Dist ( C ω , F ) | ∃ σ ∈ Σ X ( A , s ) , Pc σ ,σ A ,s ⊑ µ } . UCol X ⊑ ( A , s, σ ) corresponds to all the distributions that are at least as good for ⊑ as a distri-bution that P can induce by playing a strategy σ of type X against σ ; this set is upward-closed w.r.t. ⊑ . We can define a similar operator for strategies of P : for s ∈ S init , σ ∈ Σ X ( A , S init ), DCol X ⊑ ( A , s, σ ) = { µ ∈ Dist ( C ω , F ) | ∃ σ ∈ Σ X ( A , s ) , µ ⊑ Pc σ ,σ A ,s } . For σ , σ ′ ∈ Σ X ( A , S init ), we say that σ is at least as good as σ ′ from s ∈ S init under X strategies if UCol X ⊑ ( A , s, σ ) ⊆ UCol X ⊑ ( A , s, σ ′ ) . This inclusion means that the best replies of P against σ ′ yield an outcome that is at least as bad(w.r.t. ⊑ ) as the best replies of P against σ .Symmetrically for σ , σ ′ ∈ Σ X ( A , S init ), we say that σ is at least as good as σ ′ from s ∈ S init under X strategies if DCol X ⊑ ( A , s, σ ) ⊆ DCol X ⊑ ( A , s, σ ′ ) . Definition 10 (Optimal strategy).
Let G = ( A , S init , ⊑ ) be an initialized game and X ∈ { PFM , P , GFM , G } be a type of strategies. A strategy σ i ∈ Σ X i ( A , S init ) is X -optimal in G if it is at least as goodunder X strategies as any other strategy in Σ X i ( A , S init ) from all s ∈ S init . When the considered preference relation ⊑ is clear from the context, we often talk about X -optimalityin an initialized arena ( A , S init ) to refer to X -optimality in the initialized game ( A , S init , ⊑ ). Notice thatgiven two isomorphic initialized arenas, there is an obvious bijection between the strategies on them, andthe properties of the strategies (pure, memoryless, finite-memory, X -optimal. . . ) are preserved throughthis bijection.Our goal will be to understand, given a preference relation, a class of arenas, and a type of strategies,what kinds of strategies are sufficient to play optimally. In the following definition, abbreviations AIFM and FM stand respectively for arena-independent finite-memory and finite-memory . Definition 11 (Sufficiency of strategies).
Let ⊑ be a preference relation, A be a class of initializedarenas, X ∈ { PFM , P , GFM , G } be a type of strategies, and M be a memory skeleton. – We say that pure memoryless strategies (resp. pure strategies based on M ) suffice to play X -optimally in A for P if for all ( A , S init ) ∈ A , P has a pure memoryless strategy (resp. a purestrategy based on M ) that is X -optimal in ( A , S init ) . – We say that pure AIFM strategies suffice to play X -optimally in A for P if there exists a memoryskeleton M such that pure strategies based on M suffice to play X -optimally in A for P . – We say that pure FM strategies suffice to play X -optimally in A for P if for all ( A , S init ) ∈ A ,there exists a memory skeleton M such that P has a pure strategy based on M that is X -optimalin ( A , S init ) . If A is clear in the context (typically all initialized deterministic or stochastic arenas), we often omitit. When no type of strategies is specified, it means that we consider optimality against all (general)strategies.Since memoryless strategies are a specific kind of finite-memory strategies based on the same memoryskeleton M triv , the sufficiency of pure memoryless strategies is equivalent to the sufficiency of purestrategies based on M triv , and is therefore just a specific case of the sufficiency of pure AIFM strategies.Notice the difference between the order of quantifiers for AIFM and FM strategies: the sufficiency ofpure AIFM strategies implies the sufficiency of pure FM strategies, but the opposite is false, as we showin the following example. Example 12.
Let us consider the energy parity winning condition studied in deterministic arenasin [CD12]. We do not explain this winning condition in detail, but comment on its memory require-ments and how it illustrates the difference between AIFM and FM strategies. For this objective, in8eterministic arenas, pure memoryless strategies suffice to play optimally for P . On the other hand, P can play optimally with finite memory, but the memory needed depends on the number of states ofthe arena ; it is not arena-independent. Therefore, pure FM strategies suffice in deterministic games for P , but not pure AIFM strategies: P needs to change its memory skeleton depending on the arena,and no single memory skeleton is sufficient to play optimally in all deterministic arenas (even with afixed and finite number of colors).Interestingly, it is shown in [MSTW17] that the same winning condition needs infinite memory in(even one-player) stochastic arenas for P , which shows that the sufficiency of pure FM strategies indeterministic arenas does not imply the sufficiency of pure FM strategies in stochastic arenas.Now let us reconsider the weak parity winning condition W wp introduced in Example 8: the goal of P is to maximize the probability that the greatest color seen is even. As will be proven formally inSection 6 thanks to the results of this article, to play optimally in any stochastic game, it is sufficientfor both players to remember the greatest color already seen, which can be implemented by the memoryskeleton M max = ( N , , ( m, n ) max { m, n } ). As explained above, this memory skeleton has an infinitestate space, but as there are only finitely many colors in every (finite) arena, only a finite part of theskeleton is sufficient to play optimally in a given arena. The size of the skeleton used for a fixed arenadepends on the appearing colors, but for a fixed number of colors, it does not depend on parameters ofthe arena (such as its state and action spaces). Therefore pure AIFM strategies suffice to play optimallyfor both players, and more precisely pure strategies based on M max suffice for both players. ⊳ We define a second stronger notion related to optimality of strategies, which is the notion of subgameperfect strategy : a strategy is subgame perfect in a game if it reacts optimally to all histories consistentwith the arena, even histories not consistent with the strategy itself, or histories that only a non-rationaladversary would play. To do so, we first need an extra definition.
Definition 13 (Shifted distributions, strategies and preference relations).
For w ∈ C ∗ , µ ∈ Dist ( C ω , F ) , we define the shifted distribution wµ as the distribution such that for an event E ∈ F , wµ ( E ) = µ ( { w ′ ∈ C ω | ww ′ ∈ E } ) .Let ( A , S init ) be an arena, and σ i ∈ Σ G i ( A , S init ) . For ρ = s a s . . . a n s n ∈ Hists ( A , S init ) , we definethe shifted strategy σ i [ ρ ] ∈ Σ G i ( A , out ( ρ )) as the strategy such that, for ρ ′ = s n a n +1 s n +1 . . . a m s m ∈ Hists i ( A , out ( ρ )) , σ i [ ρ ]( ρ ′ ) = σ i ( s a s . . . a m s m ) .For ⊑ a preference relation and w ∈ C ∗ , we define the shifted preference relation ⊑ [ w ] as thepreference relation such that for µ, µ ′ ∈ Dist ( C ω , F ) , µ ⊑ [ w ] µ ′ if and only if wµ ⊑ wµ ′ . Definition 14 (Subgame perfect strategy).
Let G = ( A , S init , ⊑ ) be an initialized game and X ∈{ PFM , P , GFM , G } be a type of strategies. A strategy σ i ∈ Σ X i ( A , S init ) is X -subgame perfect ( X -SP) in G if for all ρ ∈ Hists ( A , S init ) , shifted strategy σ i [ ρ ] is X -optimal in the initialized game ( A , out ( ρ ) , ⊑ [ c col ( ρ )] ) . Strategies that are X -SP are in particular X -optimal; the converse is not true in general.For technical reasons, we will use the notion of equilibrium as a tool in proofs to show the existenceof optimal strategies. Definition 15 (Nash, SP equilibrium).
Let G = ( A , S init , ⊑ ) be an initialized game and X ∈{ PFM , P , GFM , G } be a type of strategies. A pair of strategies ( σ , σ ) ∈ Σ X ( A , S init ) × Σ X ( A , S init ) is an X -Nash equilibrium ( X -NE) in G if for all σ ′ ∈ Σ X ( A , S init ) , for all σ ′ ∈ Σ X ( A , S init ) , for all s ∈ S init , Pc σ ′ ,σ A ,s ⊑ Pc σ ,σ A ,s ⊑ Pc σ ,σ ′ A ,s . We say that ( σ , σ ) is an X -subgame perfect equilibrium ( X -SPE) if for all ρ ∈ Hists ( A , S init ) , the pairof strategies ( σ [ ρ ] , σ [ ρ ]) is an X -Nash equilibrium in ( A , out ( ρ ) , ⊑ [ c col ( ρ )] ) . Work on deterministic arenas only consider P -optimality, but as pure strategies suffice for Borel objec-tives [Mar75], this implies G -optimality. σ , σ ) is thus an X -NE if no player has any interest in unilaterally deviating from itsstrategy (using strategies of type X ), as the induced probability distribution on the colors (or equivalently,the induced Markov chain) would not be better for this player than not changing the strategy. In thezero-sum context, if ( σ , σ ) is an X -NE (resp. X -SPE), then both σ and σ are X -optimal (resp. SP).We say that a pair of strategies ( σ , σ ) is pure (resp. randomized, memoryless, based on M ) if both σ and σ are pure (resp. randomized, memoryless, based on M ). Thanks to the fact that we considerzero-sum games, we can use the following handy result about Nash equilibria. Lemma 16.
Let G = ( A , S init , ⊑ ) be a game and X ∈ { PFM , P , GFM , G } be a type of strategies. Let ( σ a , σ a ) , ( σ b , σ b ) ∈ Σ X ( A , S init ) × Σ X ( A , S init ) be two X -NE (resp. X -SPE) in G . Then ( σ a , σ b ) is also an X -NE (resp. X -SPE) in G .Proof. A very similar proof appears in [BLO +
20, Lemma 2]. We do the proof for NE, and the result aboutSPE follows since its definition uses the notion of NE. We need to prove that for all σ ′ ∈ Σ X ( A , S init ),for all σ ′ ∈ Σ X ( A , S init ), for all s ∈ S init , Pc σ ′ ,σ b A ,s ⊑ Pc σ a ,σ b A ,s ⊑ Pc σ a ,σ ′ A ,s . (1)Since ( σ a , σ a ) is an X -NE, we know that Pc σ b ,σ a A ,s ⊑ Pc σ a ,σ a A ,s ⊑ Pc σ a ,σ b A ,s , instantiating σ ′ and σ ′ as σ b and σ b in the definition of X -NE. Similarly, since ( σ b , σ b ) is an X -NE, weknow that Pc σ a ,σ b A ,s ⊑ Pc σ b ,σ b A ,s ⊑ Pc σ b ,σ a A ,s , instantiating σ ′ and σ ′ as σ a and σ a in the definition of X -NE.One can see from the last two lines that all six probability distributions over sequences of colors areequivalent w.r.t. ⊑ as the inequalities form a cycle. Now, let σ ′ ∈ Σ X ( A , S init ) and σ ′ ∈ Σ X ( A , S init ).Since ( σ a , σ a ) and ( σ b , σ b ) are both X -NE, and since Pc σ a ,σ b A ,s is equivalent w.r.t. ⊑ to both Pc σ a ,σ a A ,s and Pc σ b ,σ b A ,s , we obtain Pc σ ′ ,σ b A ,s ⊑ Pc σ b ,σ b A ,s ⊑ Pc σ a ,σ b A ,s ⊑ Pc σ a ,σ a A ,s ⊑ Pc σ a ,σ ′ A ,s , thus (1) is verified. ⊓⊔ Operations on arenas.
We introduce two operations on arenas that we will use multiple times throughthe course of this article.For A an arena, w ∈ C ∗ , and s a state of A , we write A w s (2)for the prefix-extended arena that consists of arena A with an extra “chain” of states leading up to s with the same colors as w . Formally, if A = ( S , S , A, δ, col ), w = c c . . . c n , and s ∈ S , we define A w s as the arena ( S ′ , S ′ , A ′ , δ ′ , col ′ ) where S ′ = S ⊎ { s w , . . . , s wn − } , S ′ = S ; A ′ = A ⊎ { a w } , A ′ | S = A ,and for i , 0 ≤ i ≤ n − A ′ ( s wi ) = { a w } ; δ ′ | S × A = δ , for i , 0 ≤ i < n − δ ′ ( s wi , a w , s wi +1 ) = 1, and δ ′ ( s wn − , a w , s ) = 1; col ′ (cid:12)(cid:12) S × A = col , and for i , 0 ≤ i ≤ n − col ′ ( s wi , a w ) = c i +1 .If µ = Pc σ ,σ A ,s is induced by strategies σ and σ on some initialized arena ( A , s ), notice that theshifted distribution wµ equals Pc σ ′ ,σ ′ A w s ,s w , where σ ′ play the only available action a w until it reaches s ,and then σ ′ and σ ′ play like σ and σ , ignoring they ever saw w .For two arenas A and A with disjoint state spaces, if s and s are two states controlled by P that are respectively in A and A with disjoint sets of available actions, we write( A , s ) ⊔ ( A , s ) (3)10or the merged arena in which s and s are merged, and everything else is kept the same. The mergedstate which comes from the merge of s and s is usually called t . Formally, let A = ( S , S , A , δ , col ), A = ( S , S , A , δ , col ), s ∈ S , and s ∈ S . We assume that S ∩ S = ∅ and that A ( s ) ∩ A ( s ) = ∅ .We define ( A , s ) ⊔ ( A , s ) as the arena ( S , S , A, δ, col ) with S = S ⊎ S ⊎{ t }\{ s , s } , S = S ⊎ S ; A ( t ) = A ( s ) ⊎ A ( s ) and all the other available actions are kept the same as in the original arenas;for i ∈ { , } , δ ( t, a ) = δ i ( t, a ) if a ∈ A ( s i ) and all the other transitions are kept the same as in theoriginal arenas (with transitions going to s or s being directed to t ); for i ∈ { , } , col ( t, a ) = col i ( t, a )if a ∈ A ( s i ) and all the other colors are kept the same as in the original arenas. A symmetrical definitioncan be written if s and s are both controlled by P .In practice, we often consider classes A of initialized arenas that are closed with respect to someoperation — we specify the exact meaning for each operation we will use here: – for M a memory skeleton, A is closed by product with M if for all ( A , S init ) ∈ A , ( A , S init ) ⋉ M ∈ A ; – A is closed by prefix-extension if for all ( A , S init ) ∈ A , for all w ∈ C ∗ , for all states s of A , ( A w s , S init ∪{ s w } ) ∈ A .Standard classes of arenas are all closed by these operations: we give as examples the classes of allinitialized one-player deterministic arenas of P , one-player stochastic arenas of P , two-player deter-ministic arenas, and two-player stochastic arenas (corresponding respectively to the classes of 1-player,1 -player, 2-player, 2 -player arenas often found in the literature). Throughout the article, we state re-sults with the exact required closure properties for generality, but most applications use such standardclasses. In this section, we establish a few key results about memory and playing optimally. The main tool isgiven by Lemma 21, which can be used to reduce questions about the sufficiency of AIFM strategiesin reasonable classes of initialized arenas to the sufficiency of memoryless strategies in a subclass. Weend the section by showing the use of Lemma 21 in the proof of our first main result (Theorem 22),which shows that the sufficiency of pure AIFM strategies implies the stronger existence of pure AIFMSP strategies in well-behaved classes of initialized arenas.First, we restate the classical result linking playing optimally with memory M in an initialized arenaand playing optimally with a memoryless strategy in its product with M . Lemma 17.
Let ⊑ be a preference relation, X ∈ { PFM , P , GFM , G } be a type of strategies, and M = ( M, m init , α upd ) be a memory skeleton. Let G = ( A , S init , ⊑ ) be an initialized game, and let σ i ∈ Σ X i ( A , S init ) be a finite-memory strategy encoded by a Mealy machine Γ = ( M , α nxt ) . Then, σ i is X -optimal in G if and only if α nxt corresponds to a memoryless X -optimal strategy in game G ′ = (( A , S init ) ⋉ M , ⊑ ) . A proof to a very similar result can be found in [BLO +
20, Lemma 1]. This lemma can be restatedfor SP strategies, for NE and for SPE with a very similar proof.
Proof.
This proof goes through multiple steps, which all rely on establishing a correspondence betweenproperties of ( A , S init ) and ( A , S init ) ⋉ M . We first establish a bijection H between their finite historiesand then a bijection f between their strategies. This is sufficient to show that the UCol X ⊑ operators arepreserved through f , which shows that X -optimality is preserved through f . It is then left to show that f ( σ ) corresponds to α nxt .We first define a bijection H : Hists ( A , S init ) → Hists (( A , S init ) ⋉ M ). Let ρ = s a s . . . a n s n ∈ Hists ( A , S init ). We set m = m init , and for 1 ≤ j ≤ n , m j = α upd ( m j − , col ( s j − , a j )). We define H ( ρ ) = ( s , m ) a ( s , m ) . . . a n ( s n , m n ). Notice that c col ( H ( ρ )) = c col ( ρ ). Furthermore, H is bijective;as the initial state of the memory m init is fixed and the memory skeleton is deterministic, the memorystates added to ρ to obtain H ( ρ ) are uniquely determined.11e now show that there is a correspondence between strategies of Σ X i ( A , S init ) and strategies of Σ X i (( A , S init ) ⋉ M ): intuitively, augmenting the arena with the skeleton allows some strategies to beplayed using less memory, but does not fundamentally change each player’s possibilities. We define afunction f : Σ X i ( A , S init ) → Σ X i (( A , S init ) ⋉ M ). For τ i ∈ Σ G i ( A , S init ), ρ ′ ∈ Hists i (( A , S init ) ⋉ M ), we define f ( τ i )( ρ ′ ) = τ i ( H − ( ρ ′ )) (exploiting that actions are the same in ( A , S init ) as in ( A , S init ) ⋉ M ). Function f is bijective (for τ ′ i ∈ Σ G i (( A , S init ) ⋉ M ), its inverse f − can be specified as f − ( τ ′ i ) = τ ′ i ◦H ). Moreover,it preserves the pure/randomized and the finite-memory/infinite-memory features of the strategies.We observe the following fact about f : for all s ∈ S init , for all τ ∈ Σ X ( A , S init ) , τ ∈ Σ X ( A , S init ), Pc τ ,τ A ,s = Pc f ( τ ) ,f ( τ )( A ,s ) ⋉ M . (4)It can easily be proven by induction that these probability distributions match on all cylinders (as theyalways induce the same distributions on the actions and on the colors after corresponding histories ρ and H ( ρ )), hence they are equal.Now let τ ∈ Σ X ( A , S init ). We notice that for all s ∈ S init , UCol X ⊑ ( A , s, τ ) = { µ ∈ Dist ( C ω , F ) | ∃ τ ∈ Σ X ( A , S init ) , Pc τ ,τ A ,s ⊑ µ } = { µ ∈ Dist ( C ω , F ) | ∃ τ ∈ Σ X ( A , S init ) , Pc f ( τ ) ,f ( τ ) A ⋉ M , ( s,m init ) ⊑ µ } by (4)= { µ ∈ Dist ( C ω , F ) | ∃ τ ′ ∈ Σ X (( A , S init ) ⋉ M ) , Pc f ( τ ) ,τ ′ A ⋉ M , ( s,m init ) ⊑ µ } = UCol X ⊑ (( A , S init ) ⋉ M , ( s, m init ) , f ( τ )) , where the penultimate line holds by bijectivity of f . The property holds symmetrically for a strategy τ ∈ Σ X ( A , S init ). Thus X -optimality of strategies is preserved through f .Now remember that σ i ∈ Σ X i ( A , S init ) is a strategy encoded by a Mealy machine ( M , α nxt ). We noticethat f ( σ i ) corresponds to α nxt interpreted over the product initialized arena and is thus memoryless. Bythe previous property, we have that σ i is X -optimal in G if and only if α nxt is X -optimal in G ′ . ⊓⊔ We now define a property of initialized arenas called coverability by M (for a memory skeleton M ),which happens to characterize initialized arenas that are a product with M (Lemma 19). Albeit intuitive,this is a key technical step, as the class of arenas covered by a memory skeleton is sufficiently well-behavedto support edge-induction arguments, whereas it is more difficult to perform such techniques directlyon the class of product arenas: removing a single edge from a product arena makes it hard to expressas a product arena, whereas it is clear that coverability is preserved. Definition 18 (Coverability by M ). An initialized arena (( S , S , A, δ, col ) , S init ) is covered by mem-ory skeleton M = ( M, m init , α upd ) if there exists a function φ : S → M such that for all s ∈ S init , φ ( s ) = m init , and for all ( s, a, s ′ ) ∈ δ , α upd ( φ ( s ) , col ( s, a )) = φ ( s ′ ) . This property means that it is possible to assign a unique memory state to each arena state suchthat transitions of the arena always update the memory state in a way that is consistent with thememory skeleton. Note that isomorphism of initialized arenas preserves coverability by any memoryskeleton. Also, every initialized arena is covered by M triv , which is witnessed by the constant function φ associating m init to every state.Definitions close to our notion of coverability by M were introduced for deterministic arenasin [Kop08,BLO + adherencewith M in [Kop08, Definition 8.12] is very similar, but does not distinguish initial states from therest (neither in the arena nor in the memory skeleton) — the reason is that [Kop08] only considers prefix-independent objectives, for which selecting the right initial memory state is not as important(see [Kop08, Proposition 8.2]). Our property of ( A , S init ) being covered by M is also equivalent to A Remember that our preference relation is defined over distributions over sequences of colors , and Equation (4)compares two such distributions. prefix-covered and cyclic-covered by M from S init , as defined in [BLO + +
20] as they are used at different places in proofs ( prefix-covered alongwith monotony , and cyclic-covered along with selectivity ). Here, we opt for a single concise definition,as most of our proofs do not mention monotony and selectivity .We link products and coverability: first, product initialized arenas are covered; second, coveredinitialized arenas are exactly the ones that are isomorphic to their own product.
Lemma 19.
Let M be a memory skeleton and ( A , S init ) be an initialized arena. The product initializedarena ( A , S init ) ⋉ M is covered by M . Moreover, ( A , S init ) is covered by M if and only if ( A , S init ) isisomorphic to ( A , S init ) ⋉ M .Proof. Let M = ( M, m init , α upd ), A = ( S , S , A, δ, col ). We show that ( A , S init ) ⋉ M is covered by M .Let φ : S × M → M be the projection on M (that is, φ ( s, m ) = m for all ( s, m ) ∈ S × M ). This functionwitnesses that ( A , S init ) ⋉ M is covered, by definition of product initialized arena: only transitions thatare consistent with the memory structure are allowed.Assume now ( A , S init ) is covered by M , witnessed by function φ . We show that ( A , S init ) is isomorphicto ( A , S init ) ⋉ M . In ( A , S init ) ⋉ M , it is not possible to reach two states ( s, m ), ( s, m ′ ) with m = m ′ from S init × { m init } : otherwise, this would contradict that ( A , S init ) is covered by M . Hence the function ψ S : s ( s, φ ( s )) is a bijection between states of ( A , S init ) and states of ( A , S init ) ⋉ M (remember thatwe only keep the reachable states of the product initialized arena). Moreover, all the actions, transitionsand colors are preserved, by definition of product initialized arena. Hence ( A , S init ) is isomorphic to itsown product with M . Now for the other direction, assume ( A , S init ) is isomorphic to ( A , S init ) ⋉ M .Thus, ( A , S init ) can be expressed as a product with M and is thus covered by M by the first claim. ⊓⊔ This last lemma shows in some sense an equivalence between a product and a covered initializedarena: product initialized arenas are covered (a similar result for non-initialized arenas is discussedin [BLO +
20, Lemma 8]) and conversely, covered initialized arenas can be written as a product. Thelatter implication requires the use of initialized arenas to be expressed in a concise way: if a game couldalways start from any state of an arena, taking the product with a memory skeleton would virtuallyalways make the arena grow, and it could therefore not be isomorphic to its own product. That is oneof the main reasons we resort to initialized arenas: we are therefore able to talk interchangeably about being a product , which is a technical property at the core of the idea of playing with memory , and about coverability , which is a more intuitive, easy-to-check condition that trivially benefits from nice closureproperties.We establish two easy consequences of the previous lemmas to have a better understanding of thelinks between covered and product initialized arenas.
Corollary 20.
Let M and M be two memory skeletons. An initialized arena ( A , S init ) is covered by M and by M if and only if it is covered by M ⊗ M .Proof. Let ( A , S init ) be covered by M and by M . It is thus isomorphic to (( A , S init ) ⋉ M ) ⋉ M byapplying Lemma 19 twice. Notice that (( A , S init ) ⋉ M ) ⋉ M is isomorphic to ( A , S init ) ⋉ ( M ⊗ M )(simply consider the bijection ψ S : (( s, m ) , m ) ( s, ( m , m ))). Hence, ( A , S init ) is isomorphic to( A , S init ) ⋉ ( M ⊗M ), and by using Lemma 19 in the other direction, is covered by M ⊗M . Followingthe arguments backwards yields the other direction of the implication. ⊓⊔ The following lemma sums up our main practical use of the idea of coverability , by proving an equiv-alence between optimal strategies with memory in initialized arenas and memoryless optimal strategiesin covered initialized arenas, in classes of arenas with mild hypotheses.
Lemma 21.
Let M be a memory skeleton, and X ∈ { PFM , P , GFM , G } be a type of strategies. Let A bea class of initialized arenas closed by product with M . Then, P has an X -optimal (resp. X -SP) strategybased on M in all initialized arenas in A if and only if P has a memoryless X -optimal (resp. X -SP)strategy in all initialized arenas covered by M in A . Also, there is an X -NE (resp. X -SPE) based on M in all initialized arenas in A if and only if there is a memoryless X -NE (resp. X -SPE) in all initializedarenas covered by M in A . roof. We first prove that products of initialized arenas in A with M correspond exactly to initializedarenas of A covered by M , that is, { ( A , S init ) ⋉ M | ( A , S init ) ∈ A } = { ( A , S init ) ∈ A | ( A , S init ) is covered by M} . (5)We start with the left-to-right inclusion. Let ( A , S init ) ⋉ M be an initialized product arena, with( A , S init ) ∈ A . Then, ( A , S init ) ⋉ M belongs to A , as A is closed by product with M . Moreover,( A , S init ) ⋉ M is covered by M by Lemma 19. For the right-to-left inclusion, let ( A , S init ) ∈ A becovered by M . Then, it is isomorphic to ( A , S init ) ⋉ M by Lemma 19. Therefore, it can be expressed asthe product of an initialized arena of A with M .We now prove the main statements of the lemma. Player P has an X -optimal (resp. X -SP) strategybased on M in all initialized arenas of A if and only if P has a memoryless X -optimal (resp. X -SP)strategy in all products of initialized arenas in A with M (by Lemma 17) if and only if P has amemoryless X -optimal (resp. X -SP) strategy in all initialized arenas covered by M in A (by (5)).Similarly, there is an X -NE (resp. X -SPE) based on M in all initialized arenas in A if and only if thereis a memoryless X -NE (resp. X -SPE) in all products of initialized arenas in A with M (by Lemma 17)if and only if there is a memoryless X -NE (resp. X -SPE) in all initialized arenas covered by M in A (by (5)). ⊓⊔ We conclude this section with one of our main results, which shows that when pure strategies basedon the same memory skeleton M are sufficient to play optimally , then pure SP strategies based on M exist. Theorem 22.
Let M be a memory skeleton, and X ∈ { PFM , P , GFM , G } be a type of strategies. Let A bea class of initialized arenas closed by product with M and by prefix-extension. If P has pure X -optimalstrategies based on M in all initialized arenas of A , then P has pure X -SP strategies based on M in allinitialized arenas of A . If there exist pure X -NE based on M in all initialized arenas of A , then thereexist pure X -SPE based on M in all initialized arenas of A .Proof. We start by proving the first claim (going from X -optimal to X -SP strategies). As A is closed byproduct with M , by Lemma 21, both the hypothesis and the thesis of this claim can be reformulated forpure memoryless strategies in initialized arenas covered by M . We thus prove equivalently that P haspure memoryless X -SP strategies in all initialized arenas covered by M in A , based on the hypothesisthat P has pure memoryless X -optimal strategies in all initialized arenas covered by M in A .Let ( A , S init ) ∈ A be covered by M . By hypothesis, P has a pure memoryless X -optimal strategy σ on ( A , S init ). If this strategy is X -SP, then we are done. If not, then that means that there exists ρ ∈ Hists ( A , S init ), with s = out ( ρ ) and w = c col ( ρ ), such that σ [ ρ ] is not X -optimal in ( A , s , ⊑ [ w ] ).We extend arena A to a new prefix-extended arena A = ( A ) w s (this notation was introducedat (2)) by “plugging” a copy of history ρ before s . We also fix S init = S init ⊎ { s w } , where s w is thefirst state of the newly added chain with colors similar to w . Initialized arena ( A , S init ) is in A since A is closed by prefix-extension. We show that ( A , S init ) is covered by M : the covering property holdsfrom S init because ( A , S init ) was already covered by M and the newly added states are not reachablefrom S init , and it holds from s w because the colors up to s are the same as history ρ from S init .By hypothesis, there exists a pure memoryless X -optimal strategy σ on ( A , S init ). We argue that σ is X -optimal in ( A , s , ⊑ [ w ] ) (i.e., after seeing ρ ); if it were not, then it would not be X -optimalfrom ( A , s w , ⊑ ), as in both cases, the sequence of colors w is seen before s is reached, and the same(memoryless) strategy is played from s . The restriction of strategy σ to Hists ( A , S init ) is thereforealso X -optimal in ( A , S init ), but it is better than σ after seeing ρ .If the restriction of σ to Hists ( A , S init ) is X -SP in ( A , S init ), then we are done. If not, then it meansthat some history ρ ∈ Hists ( A , S init ) witnesses that σ is not X -SP in ( A , S init ). Let w = c col ( ρ ), s = out ( ρ ). We can keep going and build an arena A = ( A ) w s , with initial states S init = S init ⊎ { s w } ,which gives us a pure memoryless X -optimal strategy σ on ( A , S init ).We keep building initialized arenas ( A i , S i init ) and pure memoryless X -optimal strategies σ i as long asthe restrictions of the strategies to Hists ( A , S init ) are not X -SP in ( A , S init ). We argue that this iteration14nds after a finite number of steps. The restriction of every strategy σ i to Hists ( A , S init ) is necessarilydifferent from the same restriction for all the previous strategies: for all j , 0 ≤ j < i , σ i is better than σ j after seeing history ρ j , and can therefore not be equal to σ j . Moreover, there are only finitely manypure memoryless strategies on ( A , S init ) (as this arena is finite), and there is a bijection between purememoryless strategies of arenas ( A i , S i init ) and of arena ( A , S init ) (as building prefix-extensions does notprovide more choices for memoryless strategies).Combining that all strategies σ i are different and the finiteness of the number of strategies showsthat the iteration ends, and therefore, that, for some i ≥
0, the restriction of the pure memorylessstrategy σ i to Hists ( A , S init ) is X -SP in ( A , S init ).The proof to go from pure X -NE based on M to pure X -SPE based on M works in the same way,as there are also finitely many pairs of pure memoryless strategies. ⊓⊔ The facts that we consider finite arenas and that the hypothesis is about pure AIFM strategies areboth crucial in the previous proof, as we need the finiteness of the type of strategies considered.This result shows a major distinction between the sufficiency of AIFM strategies and the moregeneral sufficiency of FM strategies: if a player can always play optimally with the same memory,then SP strategies may be played with the same memory as optimal strategies — if a player can playoptimally but needs arena- dependent finite memory, then infinite memory may still be required toobtain SP strategies. One such example is provided in [LPR18, Example 16] for the average-energygames with lower-bounded energy objective in deterministic games: P can always play optimally withpure finite-memory strategies [BHM +
17, Theorem 13], but infinite memory is needed for SP strategies.
Our goal in this section is to obtain a practical tool to help study the memory requirements of two-playerstochastic (or deterministic) games. This tool consists in reducing the study of the sufficiency of pureAIFM strategies for both players in two -player games to one -player games. We will first state our result,and the rest of the section is devoted to its proof. This result mentions two properties of classes of arenascalled being closed by subarena and closed by split , which we will introduce later. In particular, it canbe instantiated with A being the class of all initialized deterministic arenas or the class of all initializedstochastic arenas. Theorem 23 (Pure AIFM one-to-two-player lift).
Let ⊑ be a preference relation, M and M betwo memory skeletons, and X ∈ { PFM , P , GFM , G } be a type of strategies. Let A be a class of initializedarenas that is closed by subarena, by split, and by product with M and M . Assume that – in all initialized one-player arenas of P in A , P can play X -optimally with a pure strategy basedon memory M ; – in all initialized one-player arenas of P in A , P can play X -optimally with a pure strategy basedon memory M .Then all initialized two-player arenas in A admit a pure X -NE based on memory M ⊗ M . If A ismoreover closed by prefix-extension, then all initialized two-player arenas in A admit a pure X -SPEbased on memory M ⊗ M . The practical usage of this result can be summed up as follows: to determine whether pure AIFMstrategies are sufficient for both players in stochastic (resp. deterministic) arenas to play X -optimally,it is sufficient to prove it for stochastic (resp. deterministic) one-player arenas. Studying memory re-quirements of one-player arenas is significantly easier than studying memory requirements of two-playerarenas, as a one-player arena can be seen as a graph (in the deterministic case) or an MDP (in thestochastic case). Still, we will bring more tools to study memory requirements of one-player arenas inSection 5.Our proof technique for Theorem 23 is able to deal in a uniform manner with stochastic arenasand with deterministic arenas, under different types of strategies. It borrows ideas from [GZ09] and15rom [BLO +
20] and extends them both: it extends [GZ09] by generalizing to a wider type of strategies(AIFM instead of memoryless) and it extends [BLO +
20] by extending the class of arenas and preferencerelations considered (stochastic instead of deterministic). Thanks to Theorem 22, we also go further inour understanding of the optimal strategies: we are able to obtain the existence of X -SPE with almostthe same constraints, instead of the weaker existence of X -NE. In order to prove Theorem 23, we first establish a similar result about memoryless strategies. We carryout all our intermediate proofs with the concept of Nash equilibrium, and we will strengthen it tosubgame perfect equilibria at the end, thanks to Theorem 22.
Lemma 24 (Memoryless one-to-two-player lift).
Let ⊑ be a preference relation and X ∈ { PFM , P , GFM , G } be a type of strategies. Let A be a class of initialized arenas that is closed by subarena and bysplit. If both players have pure memoryless X -optimal strategies in the initialized one-player arenas in A ,then all initialized arenas in A admit a pure memoryless X -NE. This result and its proof originate from [GZ09, Theorem 9], with the addition that we consider here initialized arenas, and that the strategies do not have to be optimal from all states, but only from thespecified initial states. We explain why, albeit small, this is an important addition to obtain our result.Lemma 24 can immediately be instantiated with A being the class of all deterministic or stochasticarenas to obtain an interesting result about pure memoryless strategies. Our goal will be to instantiateit, for some fixed memory skeleton M , with the class of initialized arenas covered by M , so that we canlater obtain results about pure AIFM strategies (through Lemma 21) instead of only using it for purememoryless strategies, in a similar spirit to [BLO + M happens to be closed by subarena and by split. This extension to pure AIFM strategiesis one precise step where our notion of initialized arenas finds its use: in covered (or product) arenas,we are only interested in optimality from arena states associated to memory state m init , and not fromall states. Hence without restating and reproving [GZ09, Theorem 9] with an extra quantification onthe initial states, it seems difficult to extend it in a straightforward way to a result about pure AIFMstrategies. We recall the definitions of subarena and split from [GZ09], extending them in a natural way to initializedarenas.
Definition 25 (Initialized subarena).
Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arena. An initialized subarena of ( A , S init ) is an initialized arena (( S , S , A ′ , δ ′ , col ′ ) , S init ) such that A ′ ⊆ A , δ ′ ⊆ δ (that is, some states might lose a few available actions), and col ′ is the restriction of col to the pairs ( s, a ) such that s ∈ S and a ∈ A ′ ( s ) . An initialized subarena keeps the same state space and initial states as the original arena, but withfewer available actions. Remember that we assume that arenas are non-blocking, hence at least oneavailable action should be kept in each state of the initialized subarena. We say that a class A ofinitialized arenas is closed by subarena if for all ( A , S init ) ∈ A , if ( A ′ , S init ) is a subarena of ( A , S init ),then ( A ′ , S init ) ∈ A .We now define the notion of split on t ∈ S of an arena: the main idea is that for some state t of thearena, the state space of the arena is augmented in such a way that players remember what was thelast action played when leaving state t . In practice, for each action available in t , we make a copy of thearena where only this action is available in t , and we then merge all these arenas on state t . Every action a available in t leads to a copy of the arena in which states are labeled by a . Before introducing theformal definition, we provide an example of a split in Figure 1. There are two actions a and b available16n t , and we make a copy of the other states for each action. This way, when the game is for instancein s a , we know that the last action that was chosen in t was a (which is not necessarily the case whenin s in the original arena). The probabilities, colors and initial states are preserved in each copy of theinitialized arena. t rsa b t r a s a r b s b a b Fig. 1.
Initialized arena with A ( t ) = { a, b } (omitting colors) (left) and its split on t (right). States controlledby P (resp. P ) are depicted by circles (resp. squares). The dot after playing action a represents a stochastictransition, with probability to go to r and to go to s . Definition 26 (Split).
Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arena, and t ∈ S . For a ∈ A ( t ) , we denote ( A a , S a init ) as the initialized subarena of ( A , S init ) in which only action a is availablein t and in which all states are renamed s s a . The split on t of ( A , S init ) is the initialized arena ( Split ( A , t ) , S t init ) where Split ( A , t ) = G a ∈ A ( t ) ( A a , t a ) , with merged state called t , and S t init = S a ∈ A ( t ) { s a | s ∈ S init } (with t a = t for all a ∈ A ( t ) ). This could be defined symmetrically for a state t ∈ S . The merge operator ⊔ was introduced at (3).When considering a split arena on a state t , we use the convention that t a = t for any action a availablein t . Moreover, for a an action available in t , and S ′ ⊆ S , we write ( S ′ ) a for { s a | s ∈ S ′ } . A class A ofinitialized arenas is closed by split if for all ( A , S init ) ∈ A , for all states t of A , ( Split ( A , t ) , S t init ) ∈ A .It is possible to formulate a few intuitive results linking plays and strategies of an initialized arenaand its split. These results are provided in a very similar context in [GZ09]; we recall them precisely inAppendix A and sketch their statements here.Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arena with a state t ∈ S . For all a ∈ A ( t ), itis possible to build a natural bijection between strategies in Σ G i ( A , S init ) and strategies on the splitwith restricted initial states Σ G i ( Split ( A , t ) , S a init ). Intuitively, the available actions are the same at everystep and the split does not offer any more possibilities (besides having more initial states — that iswhy we must restrict the initial states to have a bijection). More memory might be needed to play thecorresponding strategy in ( A , S init ) than in its split (since the information of the last action played in t is not explicitly given in A ), but finite-memory strategies stay finite-memory in both directions. Thisbijection also preserves the “pure” feature of the strategy and preserves optimality and NE (Lemma 35).Now if a pair of strategies ( σ , σ ) is an X -NE in ( Split ( A , t ) , S a init ) and σ is pure memoryless with σ ( t ) = a , only the part S a of the split is ever reached during the play, which corresponds to the statespace of A . It is therefore possible to transform this X -NE into an X -NE ( σ ′ , σ ′ ) in ( A , S init ) such that σ ′ is still pure and memoryless (Lemma 36).If a pair of strategies ( σ , σ ) on ( Split ( A , t ) , S a init ) is such that σ is pure memoryless with σ ( t ) = a ,then for s a ∈ S a init , we can show that the distribution Pc σ ,σ Split ( A ,t ) ,s a is equal to Pc σ a ,σ a A a ,s a , where σ a and σ a are simply restrictions of σ and σ to histories of the subarena ( A a , S a init ) (Lemma 37).We are now ready to prove Lemma 24. The proof is by induction on the number of choices in arenas:for A = ( S , S , A, δ, col ) an arena, the number of choices in A is defined as n A = ( X s ∈ S | A ( s ) | ) − | S | . (6)17hen the number of choices in A is 0, it means that there is exactly one available action in each state. Proof (of Lemma 24).
We proceed by induction on the number of choices in arenas. If an initializedarena ( A , S init ) ∈ A is such that n A = 0, then both players only have a single available strategy whichis both pure and memoryless. Hence this pair of strategies correspond to a pure memoryless X -NE,which proves the base case. Now let n >
0: we assume that the result holds for every initialized arena( A , S init ) ∈ A with n A < n , and let ( A , S init ) ∈ A be an initialized arena such that n A = n .If P has no choice (that is, | A ( s ) | = 1 for all s ∈ S ), then ( A , S init ) is an initialized one-playerarena of P , and P has a pure memoryless X -optimal strategy by hypothesis from the statement ofthe theorem (and as P has only one possible strategy which happens to be pure and memoryless, wehave a pure memoryless X -NE). We now focus on the case where P has at least one choice: let t ∈ S be such that | A ( t ) | ≥
2. For a ∈ A ( t ), let ( A a , S a init ) be the initialized subarena of ( A , S init ) with onlyaction a available in t , and all states renamed s s a . This implies that ( A a , S a init ) is in A , as A isclosed by subarena. Notice that n A a < n A , as this is the same arena except that some available actionsare removed in t . By induction hypothesis, there is thus a pure memoryless X -NE ( σ a , σ a ) in arena( A a , S a init ), for each a ∈ A ( t ).We now consider the split on t of ( A , S init ), which we denote ( Split ( A , t ) , S t init ), and which belongsto A as A is closed by split. Consider the pure memoryless strategy σ = S a ∈ A ( t ) σ a of P defined on( Split ( A , t ) , S t init ) which, when in S a for some a ∈ A ( t ), plays the same actions as the pure memorylessstrategy σ a (this prescribes a unique action to every state of Split ( A , t ) controlled by P ; the only overlapbetween the subarenas is t , but t belongs to P ). We cannot straightaway define a similar strategy σ for P , as it would not be well-defined in t — we first have to carefully choose the action played in t .To do so, we consider the initialized one-player arena ( Split ( A , t ) σ , S t init ) of P resulting from fixingthe (pure memoryless) strategy σ of P in Split ( A , t ). This initialized arena is an initialized subarenaof ( Split ( A , t ) , S t init ) (actions of P have been removed), and hence belongs to A . By hypothesis from thestatement of the theorem, P has a pure memoryless X -optimal strategy τ in ( Split ( A , t ) σ , S t init ) sinceit is an initialized one-player arena. Let a ∗ ∈ A ( t ) be the action τ ( t ) played by P in t . We use thisstrategy τ to define a pure memoryless strategy σ of P on ( Split ( A , t ) , S t init ) that plays a ∗ in t , andfor all a ∈ A ( t ), that behaves like σ a in S a ; formally, σ ( t ) = τ ( t ) = a ∗ , and for a ∈ A ( t ), s ∈ S \ { t } , σ ( s a ) = σ a ( s a ) . We now show that ( σ , σ ) is a pure memoryless X -NE in ( Split ( A , t ) , S a ∗ init ). We restrict our attentionto initial states S a ∗ init as these states are in the part of the arena that σ always goes back to, and thatwill help us convert σ into a corresponding pure memoryless strategy on ( A , S init ) using Lemma 36. Weshow that for all s a ∗ ∈ S a ∗ init , for all σ ′ ∈ Σ X ( Split ( A , t ) , S a ∗ init ), σ ′ ∈ Σ X ( Split ( A , t ) , S a ∗ init ), Pc σ ′ ,σ Split ( A ,t ) ,s a ∗ ⊑ Pc σ ,σ Split ( A ,t ) ,s a ∗ ⊑ Pc σ ,σ ′ Split ( A ,t ) ,s a ∗ . Let s a ∗ ∈ S a ∗ init . The right-hand side inequality is clear, as the play always stays in A a ∗ by definition of σ , and the restriction of σ and σ to Hists ( A a ∗ , S a ∗ init ) is ( σ a ∗ , σ a ∗ ), which is an X -NE in ( A a ∗ , S a ∗ init ).For the left-hand side inequality, let σ ′ ∈ Σ X ( Split ( A , t ) , S a ∗ init ) be a strategy of P . We denote by τ a ∗ the restriction of τ to Hists ( A a ∗ , S a ∗ init ).We have Pc σ ′ ,σ Split ( A ,t ) ,s a ∗ ⊑ Pc τ ,σ Split ( A ,t ) ,s a ∗ as τ is X -optimal against σ in ( Split ( A , t ) , S t init ), and s a ∗ ∈ S t init = Pc τ a ∗ ,σ a ∗ A a ∗ ,s a ∗ by Lemma 37 ⊑ Pc σ a ∗ ,σ a ∗ A a ∗ ,s a ∗ as ( σ a ∗ , σ a ∗ ) is an X -NE in A a ∗ = Pc σ ,σ Split ( A ,t ) ,s a ∗ by Lemma 37 . Hence ( σ , σ ) is a pure memoryless X -NE in ( Split ( A , t ) , S a ∗ init ). We can transform ( σ , σ ) into an X -NE ( σ ′ , σ ′ ) of ( A , S init ) by Lemma 36, with σ ′ pure memoryless, but not necessarily σ ′ . We can however18erform the same proof for P , and obtain a second X -NE ( σ ′′ , σ ′′ ) such that σ ′′ is a pure memorylessstrategy on ( A , S init ). Then we simply mix both X -NE (by Lemma 16), and obtain that ( σ ′ , σ ′′ ) is apure memoryless X -NE in ( A , S init ). ⊓⊔ In this section, we show how to apply Lemma 21 to Lemma 24 to lift its results from the sufficiency ofpure memoryless strategies to the sufficiency of pure AIFM strategies, which will imply Theorem 23.
Lemma 27.
Let M be a memory skeleton, and A be a class of initialized arenas closed by subarenaand by split. The class of all initialized arenas covered by M in A is closed by subarena and by split.Proof. Let ( A = ( S , S , A, δ, col ) , S init ) ∈ A be an initialized arena covered by M = ( M, m init , α upd ).We show that its subarenas and its splits are still covered by M . Let φ : S → M be the witness that( A , S init ) is covered by M .If we consider an initialized subarena ( A ′ , S init ) of ( A , S init ), the same function φ will still be a witnessthat ( A ′ , S init ) is covered by M , as the state space is the same, the condition to check is a universallyquantified property over the transitions, and there are fewer transitions in A ′ than in A .Let t ∈ S , and let ( Split ( A , t ) , S t init ) be the split on t of ( A , S init ). We define a function b φ such that b φ ( t ) = φ ( t ) , and for a ∈ A ( t ), s ∈ S \ { t } , b φ ( s a ) = φ ( s ) . Function b φ witnesses that ( Split ( A , t ) , S t init ) is covered by M as every transition in Split ( A , t ) correspondsto a transition in A with the same color, and linking two states assigned to the same memory state in A as in Split ( A , t ). ⊓⊔ We are now ready to prove Theorem 23, the main result of this section.
Proof (of Theorem 23).
Note first that as (( A , S init ) ⋉ M ) ⋉ M is isomorphic to ( A , S init ) ⋉ ( M ⊗M ), A is in particular closed by product with M ⊗ M .Using Lemma 21, the hypotheses can be reformulated as follows: for i ∈ { , } , P i has a purememoryless X -optimal strategy in all initialized one-player arenas in A that are covered by M i .Now consider the subclass A ′ = { ( A , S init ) ∈ A | ( A , S init ) is covered by M ⊗ M } . For i ∈ { , } , P i has a pure memoryless X -optimal strategy in all its initialized one-player arenas in A ′ (using thatif an arena is covered by M ⊗ M , it is in particular covered by M i by Lemma 20). Moreover, A ′ isclosed by subarena and by split by Lemma 27. Hence by Lemma 24, for all initialized arenas in A ′ , thereexists a pure memoryless X -NE. Using Lemma 21 again in the other direction allows us to conclude thatall initialized arenas in A admit a pure X -NE based on M ⊗ M .By Theorem 22, using that A is closed by prefix-extension, the existence of pure X -NE based on M ⊗ M in all initialized arenas in A implies the existence of pure X -SPE based on M ⊗ M in allinitialized arenas in A . ⊓⊔ Theorem 23 along with Lemmas 17 and 19 actually gives a bit more information about memoryrequirements in individual arenas than is strictly written. The way it is phrased shows that for aninitialized arena ( A , S init ), memoryless strategies always suffice to play X -optimally in ( A , S init ) ⋉ ( M ⊗M ) (through Lemma 17). But if ( A , S init ) is already covered by M ⊗M , then as ( A , S init ) is isomorphicto ( A , S init ) ⋉ ( M ⊗M ) by Lemma 19, memoryless strategies are actually sufficient directly in ( A , S init ).Memory M ⊗ M is thus an upper bound on the required memory, and studying coverability may showfor some initialized arenas that less memory is sufficient: for example, if it is already covered by M ⊗M (resp. by M , by M ), memoryless strategies (resp. strategies based on M , on M ) are sufficient. Anapplication of Theorem 23 is provided in Section 6.19 AIFM characterization
For this section, we fix ⊑ a preference relation, X ∈ { PFM , P , GFM , G } a type of strategies, and M =( M, m init , α upd ) a memory skeleton. We distinguish only two classes of initialized arenas: the class A D P of all initialized one-player deterministic arenas of P , and the class A S P of all initialized one-playerstochastic arenas of P . A class of arenas will therefore be specified by a letter Y ∈ { D , S } , which we fixfor the whole section. Our aim is to give a better understanding of the preference relations for whichpure strategies based on M suffice to play X -optimally in A Y P , by characterizing it through two intuitiveconditions. All definitions and proofs are stated from the point of view of P . We first introduce somemore notations.As we only work with one-player arenas in this section, we abusively write P σ A ,s and Pc σ A ,s for thedistributions on plays and colors induced by a strategy σ of P on ( A , s ), with no strategy for P .For A a one-player arena of P and s a state of A , we write[ A ] X s = { Pc σ A ,s | σ ∈ Σ X ( A , s ) } for the set of distributions over ( C ω , F ) induced by strategies of type X in A from s .For m , m ∈ M , we write L m ,m = { w ∈ C ∗ | d α upd ( m , w ) = m } for the language of words thatare read from m up to m in M . Such a language can be specified by the deterministic automatonthat is simply the memory skeleton M with m as the initial state and m as the unique final state.We extend the shifted distribution notation introduced in Definition 13 to sets of distributions: for w ∈ C ∗ , for Λ ⊆ Dist ( C ω , F ), we write wΛ for the set { wµ | µ ∈ Λ } .Given ⊑ a preference relation, we extend it to sets of distributions: for Λ , Λ ⊆ Dist ( C ω , F ), wewrite Λ ⊑ Λ if for all µ ∈ Λ , there exists µ ∈ Λ such that µ ⊑ µ ; we write Λ ⊏ Λ if thereexists µ ∈ Λ such that for all µ ∈ Λ , µ ⊏ µ . Notice that ¬ ( Λ ⊑ Λ ) is equivalent to Λ ⊏ Λ .If Λ is a singleton { µ } , we write µ ⊑ Λ for { µ } ⊑ Λ (and similarly for Λ , and similarly using ⊏ ). Notice that µ ⊑ µ is equivalent to { µ } ⊑ { µ } , so this notational shortcut is sound. For twoinitialized arenas ( A , s ) and ( A , s ), the inequality [ A ] X s ⊑ [ A ] X s means that for every strategy oftype X on ( A , s ), there is a strategy of type X on ( A , s ) that induces a distribution that is at leastas good.We can now present the two properties of preference relations at the core of our characterization.These properties are called X - Y - M -monotony and X - Y - M -selectivity ; they depend on a type of strategies X , a type of arenas Y , and a memory skeleton M . The first appearance of the monotony (resp. selectivity)notion was in [GZ05], which dealt with deterministic arenas under pure strategies and memorylessstrategies; their monotony (resp. selectivity) is equivalent to our P - D - M triv -monotony (resp. P - D - M triv -selectivity). In [BLO + M in deterministic arenas; their notion of M -monotony (resp. M -selectivity) is equivalent toour P - D - M -monotony (resp. P - D - M -selectivity). Definition 28 (Monotony).
We say that ⊑ is X - Y - M -monotone if for all m ∈ M , for all ( A , s ) , ( A , s ) ∈ A Y P , there exists i ∈ { , } such that ∀ w ∈ L m init ,m , w [ A − i ] X s − i ⊑ w [ A i ] X s i . The crucial part of the definition is the order of the last two quantifiers: of course, given a w ∈ L m init ,m ,as ⊑ is total, it will always be the case that w [ A ] X s ⊑ w [ A ] X s or that w [ A ] X s ⊑ w [ A ] X s . However,we ask for something stronger: it must be the case that the set of distributions w [ A i ] X s i is preferred to w [ A − i ] X s − i for any word w ∈ L m init ,m .The original monotony definition [GZ05] states that when presented with a choice once amongtwo possible continuations, if a continuation is better than the other one after some prefix, then thiscontinuation is also at least as good after all prefixes. This property is not sufficient for the sufficiency ofpure memoryless strategies as it does not guarantee that if the same choice presents itself multiple timesin the game, the same continuation should always be chosen, as alternating between both continuations20ight still be beneficial in the long run — this is dealt with by selectivity. If memory skeleton M isnecessary to play optimally, then it makes sense that there might be different optimal choices dependingon the current memory state and that we should only compare prefixes that reach the same memorystate. The point of taking into account a memory skeleton M in our definition of X - Y - M -monotony isto distinguish classes of prefixes and to only compare prefixes that are read up to the same memorystate from m init . Definition 29 (Selectivity).
We say that ⊑ is X - Y - M -selective if for all m ∈ M , for all ( A , s ) , ( A , s ) ∈ A Y P such that for i ∈ { , } , c col ( Hists ( A i , s i , s i )) ⊆ L m,m , for all w ∈ L m init ,m , w [( A , s ) ⊔ ( A , s )] X t ⊑ w [ A ] X s ∪ w [ A ] X s . Our formulation of the selectivity concept differs from the original definition [GZ05] and its AIFMcounterpart [BLO +
20] in order to take into account the particularities of the stochastic context, evenif it can be proven that they are equivalent in the pure deterministic case. However, the idea is stillthe same: the original selectivity definition states that when presented with a choice among multiplepossible continuations after some prefix, if a continuation is better than the others, then as the gamegoes on, if the same choice presents itself again, it is sufficient to always pick the same continuationto play optimally; there is no need to alternate between continuations. This property is not sufficientfor the sufficiency of pure memoryless strategies as it does not guarantee that for all prefixes, the sameinitial choice is always the one we should commit to — this is dealt with by monotony. The point ofmemory skeleton M in our definition is to guarantee that every time the choice presents itself, we arecurrently in the same memory state m .In both definitions, the point of X is to distinguish whether we allow all (including randomized)strategies, or only pure strategies; the point of Y is to quantify over a specific set of arenas.An interesting property is that both notions are stable by product with a memory skeleton: if ⊑ is X - Y - M -monotone (resp. X - Y - M -selective), then for all memory skeletons M ′ , ⊑ is also X - Y -( M ⊗ M ′ )-monotone (resp. X - Y -( M ⊗ M ′ )-selective). The reason is that in each definition, we quantify universallyover the class of all prefixes w that reach the same memory state m ; if we consider classes that are subsetsof the original classes, then the definition still holds. This property matches the idea that playing withmore memory is never detrimental.Combined, it is intuitively reasonable that X - Y - M -monotony and X - Y - M -selectivity are equivalentto the sufficiency of pure strategies based on M to play X -optimally in A Y P : monotony tells us thatwhen a single choice has to be made given a state of the arena and a memory state, the best choice isalways the same no matter what prefix has been seen, and selectivity tells us that once a good choicehas been made, we can commit to it in the future of the game. We formalize and prove this idea inTheorem 32. First, we add an extra restriction on preference relations which is useful when stochasticityis involved. Definition 30 (Mixing is useless).
We say that mixing is useless for ⊑ if for all sets I at mostcountable, for all reals ( λ i ) i ∈ I such that P i ∈ I λ i = 1 , for all families ( µ i ) i ∈ I , ( µ ′ i ) i ∈ I of distributions in Dist ( C ω , F ) , if for all i ∈ I , µ i ⊑ µ ′ i , then P i ∈ I λ i µ i ⊑ P i ∈ I λ i µ ′ i . That is, if we can write a distribution as a convex combination of distributions, then it is never detri-mental to improve a distribution appearing in the convex combination.
Remark 31.
All preference relations encoded as real payoff functions (as defined in Example 8) satisfythis property, thanks to properties of the expected value. The third preference relation from Example 8(having a probability to reach c ∈ C that is precisely ) does not satisfy this property: if µ ( ♦ c ) = 0, µ ′ ( ♦ c ) = , and µ ( ♦ c ) = 1, we have µ ⊏ µ ′ and µ ⊑ µ , but µ ′ + µ ⊏ µ + µ . In case weconsider pure strategies and deterministic games, only Dirac distributions on infinite words occur asprobability distributions induced by an arena and a strategy, so the requirement that mixing is uselessis not needed. ⊳ heorem 32. Assume X ∈ { P , PFM } and Y = D , or that mixing is useless for ⊑ . Then pure strategiesbased on M suffice to play X -optimally in all initialized one-player arenas in A Y P for P if and only if ⊑ is X - Y - M -monotone and X - Y - M -selective. We start with the proof of the necessary condition of Theorem 32, which is the easiest direction. Themain idea is to build the right arenas (using the arenas occurring in the definitions of monotony andselectivity) so that we can use the hypothesis about the existence of pure X -optimal strategies based on M to immediately deduce X - Y - M -monotony and X - Y - M -selectivity. It is not necessary that mixing isuseless for ⊑ for this direction of the equivalence. s w s w ′ A A tw w ′ s w A A t w Fig. 2.
Initialized arenas ( A mon , { s w , s w ′ } ) (left) and ( A sel , s w ) (right). Proof (of the necessary condition of Theorem 32).
We assume that pure strategies based on M sufficeto play X -optimally in A Y P for P .We first prove that ⊑ is X - Y - M -monotone. Let m ∈ M and ( A , s ) , ( A , s ) ∈ A Y P be initializedone-player arenas of P . If for all w ∈ L m init ,m , both w [ A ] X s ⊑ w [ A ] X s and w [ A ] X s ⊑ w [ A ] X s , that is,if both sets of distributions are just as good as each other, then we can take either i = 1 or i = 2 andthe definition is satisfied. If that is not the case, this means that there exists w ′ ∈ L m init ,m such that,w.l.o.g., w ′ [ A ] X s ⊏ w ′ [ A ] X s . (7)We take i = 2. It is left to show that for all w ∈ L m init ,m , w [ A ] X s ⊑ w [ A ] X s . Let w ∈ L m init ,m . For j ∈ { , } , we assume w.l.o.g. that state s j has no incoming transition in A j ,and therefore cannot be reached after being left. If it is not the case, we can create a new one-playerarena A ′ j by adding a new state s ′ j mimicking the outgoing transitions of s j , but without any ingoingtransition, and we have [ A j ] X s j = [ A ′ j ] X s ′ j . We also assume w.l.o.g. that the state and action spaces of A and A are disjoint.We consider the arena A mon = ((( A , s ) ⊔ ( A , s )) w t ) w ′ t where t is the state resulting from the merge of s and s . We consider two initial states s w and s w ′ ,which are the states at the start of the “chains” corresponding respectively to w and w ′ . Arena A mon is depicted on Figure 2; A mon consists of two chains reading w and w ′ up to state t , and then a choicebetween going to A or A , with no possibility of ever going back to t .The initialized arena ( A mon , { s w , s w ′ } ) is in A Y P as all operations used preserve the number ofplayers and the deterministic/stochastic feature. By hypothesis, P has a pure X -optimal strategy σ ∈ Σ PFM ( A mon , { s w , s w ′ } ) encoded as a Mealy machine ( M , α nxt ). Remember that both w and w ′ reachstate m of the memory skeleton M when read from m init . Therefore, no matter whether the play starts22n s w or s w ′ , the action played in t by strategy σ is given by α nxt ( t, m ) (which cannot be a randomizedchoice, as σ is pure). Since σ is X -optimal in ( A mon , s w ′ ), by (7), this action must necessarily be anaction of A . Now since σ is also X -optimal in ( A mon , s w ), this means that going to A after w is atleast as good as going to A . In other words, we have w [ A ] X s ⊑ w [ A ] X s , which ends the X - Y - M -monotony proof.We now prove that ⊑ is X - Y - M -selective. Let m ∈ M and ( A , s ) , ( A , s ) ∈ A Y P such that for i ∈ { , } , Hists ( A i , s i , s i ) ⊆ L m,m . Let w ∈ L m init ,m . We consider the arena A sel = (( A , s ) ⊔ ( A , s )) w t where t is the state resulting from the merge of s and s . We consider an initial state s w , which is thestate at the start of the “chain” corresponding to w . Arena A sel is depicted on Figure 2; A sel consists ofone chain reading w up to state t , and then has the ability to go either to A and A . Here, it is possibleto visit t multiple times (as long as it was possible to go back to s in ( A , s ) or to s in ( A , s )).The initialized arena ( A sel , s w ) is in A Y P as all operations used preserve the number of players and thedeterministic/stochastic feature. By hypothesis, P has a pure X -optimal strategy σ ∈ Σ PFM ( A mon , s w )encoded as a Mealy machine ( M , α nxt ). By X -optimality of σ , we have that w [( A , s ) ⊔ ( A , s )] X t ⊑ Pc σ A sel ,s w . (8)Since w is in L m init ,m and for i ∈ { , } , c col ( Hists ( A i , s i , s i )) is a subset of L m,m , we have that c col ( Hists ( A sel , s w , t )) is a subset of L m init ,m . Therefore, at each passage in t , strategy σ plays theaction given by α nxt ( t, m ) (which cannot be a randomized choice, as σ is pure). Strategy σ thuscommits to A or A forever, which means that Pc σ A sel ,s w ∈ w [ A ] X s ∪ w [ A ] X s . By combining this last fact with (8), we obtain that w [( A , s ) ⊔ ( A , s )] X t ⊑ w [ A ] X s ∪ w [ A ] X s , which ends the X - Y - M -selectivity proof. ⊓⊔ We sketch the proof of the sufficient condition of Theorem 32. We first reduce the problem to theexistence of pure memoryless strategies in initialized arenas covered by M , using Lemma 21. We proceedwith an induction on the number of choices in these arenas (as for Theorem 23). The base case is againtrivial (as in an arena in which all states have a single available action, there is a single strategy whichis pure and memoryless). For the induction step, we take an initialized arena ( A ′ , S init ) ∈ A Y P coveredby M with at least one choice, and we pick a state t with (at least) two available actions. A memorystate φ ( t ) is associated to t thanks to the coverability property. We consider the subarenas ( A ′ a , S init )with a single action a available in t , to which we can apply the induction hypothesis and obtain a purememoryless X -optimal strategy σ a in each subarena. It is left to prove that one of these strategies isalso X -optimal in ( A ′ , S init ) — this is where X - Y - M -monotony and X - Y - M -selectivity come into play.The property of X - Y - M -monotony tells us that one of these subarenas ( A ′ a ∗ , S init ) is preferred to theothers w.r.t. ⊑ after reading any word in L m init ,φ ( t ) . We now want to use X - Y - M -selectivity to concludethat there is no reason to use actions different from a ∗ when coming back to t , and that σ a ∗ is thereforealso X -optimal in ( A ′ , S init ). To do so, we take any strategy σ ∈ Σ X ( A ′ , s ) for s ∈ S init and we conditiondistribution P σ A ′ ,s over all the ways it reaches (or not) t , which gives a convex combination of probabilitydistributions. We want to state that once t is reached, no matter how, switching to strategy σ a ∗ is alwaysbeneficial. For this, we would like to use X -subgame-perfection of σ a ∗ rather than simply X -optimality:this is why in the actual proof, our induction hypothesis is about X -SP strategies and not X -optimal23trategies. Luckily, Theorem 22 indicates that requiring subgame perfection is not really stronger thanwhat we want to prove. We then use that mixing is useless for ⊑ to replace all the parts that go through t in the convex combination by a better distribution induced by σ a ∗ from t .We need two (intuitive) technical lemmas, whose proofs can be found in Appendix B. We first definea similar notion to shifted distributions (Definition 13) for distributions on plays: for ( A , s ) an initializedone-player arena, for ρ ∈ Hists ( A , s ), if µ ∈ Dist ( Plays ( A , out ( ρ )) , F ( A , out ( ρ )) ) is a distribution on plays,then for E ∈ F ( A ,s ) an event, we define ρµ ( E ) = µ ( { π ∈ Plays ( A , out ( ρ )) | ρπ ∈ E } ) . We have used here an abuse of notation: if ρ = s a s . . . a n s n , for π = s n a n +1 s n +1 . . . ∈ Plays ( A , out ( ρ )),we write ρπ for the play s a s . . . a n s n a n +1 s n +1 . . . , with no repetition of s n . Lemma 33.
Let ( A = ( S , S , A, δ, col ) , s ) be an initialized one-player arena and ρ ∈ Hists ( A , s ) . Let µ be a distribution on plays in Dist ( Plays ( A , out ( ρ )) , F ( A , out ( ρ )) ) . We have c col ( ρµ ) = c col ( ρ ) c col ( µ ) . Lemma 34.
Let ( A , s ) ∈ A Y P be an initialized one-player arena and σ , τ ∈ Σ G ( A , s ) be two strategies.Let ρ = s a s . . . a n s n ∈ Hists ( A , s ) . We say that σ coincides with τ on ρ if for each prefix ρ i = s a s . . . a i s i of ρ with ≤ i < n , σ ( ρ i ) = τ ( ρ i ) . If σ coincides with τ on ρ , then P σ A ,s [ Cyl ( ρ )] = P τ A ,s [ Cyl ( ρ )] . Let t be a state of A . We write ¬ ♦ t for the event in F ( A ,s ) that consists of all the infinite plays thatnever visit t . Assume that for all ρ = s a s . . . a n s n ∈ Hists ( A , s ) such that for all i , ≤ i < n , s i = t , σ coincides with τ on ρ . Then P σ A ,s [ ¬ ♦ t ] = P τ A ,s [ ¬ ♦ t ] and, if P σ A ,s [ ¬ ♦ t ] > , P σ A ,s [ · | ¬ ♦ t ] = P τ A ,s [ · | ¬ ♦ t ] . We now have all the ingredients for the proof of the missing implication of Theorem 32.
Proof (of the sufficient condition of Theorem 32).
We assume now that mixing is useless for ⊑ , andthat ⊑ is X - Y - M -monotone and X - Y - M -selective. We prove that pure strategies based on M sufficeto play X -optimally in A Y P for P . Equivalently, thanks to Lemma 21, we show that for all initializedarenas covered by M in A Y P , P has a pure memoryless X -optimal strategy. We will actually provesomething stronger, which is that for all initialized one-player arenas covered by M in A Y P , P has apure memoryless X -SP strategy.Let ( A , S init ) ∈ A Y P be an initialized one-player arena covered by M . Our proof proceeds by inductionon the number of choices n A ′ of subarenas ( A ′ , S init ) of ( A , S init ). Our induction will prove the followingproperty for subarenas ( A ′ , S init ) of ( A , S init ): there exists a pure memoryless strategy σ ∈ Σ PFM ( A ′ , S init )such that for all ρ ∈ Hists ( A , S init ), σ is X -optimal in the game ( A ′ , out ( ρ ) , ⊑ [ c col ( ρ )] ). We call this property having a pure memoryless ( A , S init ) - X -SP strategy . There is a slight abuse of notation in the definition: σ is not necessarily well-defined from out ( ρ ), but as it is pure memoryless, we simply interpret it as afunction S → A , and for ρ ′ ∈ Hists ( A ′ , out ( ρ )), we define σ ( ρ ′ ) = σ ( out ( ρ ′ )). For subarenas ( A ′ , S init ),having a pure memoryless ( A , S init )- X -SP strategy is stronger than having a pure memoryless X -SPstrategy, as Hists ( A ′ , S init ) is a subset of Hists ( A , S init ). For arena ( A , S init ), having a pure memoryless( A , S init )- X -SP strategy is equivalent to having a pure memoryless X -SP strategy, which is what we wantto prove. Requiring SP strategies instead of simply optimal strategies may seem stronger than whatwe actually need, but by Theorem 22, it turns out being equivalent in this AIFM context; we use SPstrategies in this case for technical reasons.Let ( A ′ = ( S , S , A ′ , δ ′ , col ′ ) , S init ) be a subarena of ( A , S init ). If n A ′ = 0, then P has only onestrategy which is in particular a pure memoryless ( A , S init )- X -SP strategy (notation n A ′ is defined24t (6)). Now let n >
0; we assume that the property is true for all arenas ( A ′ , S init ) such that n A ′ < n ,and we take ( A ′ , S init ) such that n A ′ = n . Since n >
0, there is a state t ∈ S such that | A ′ ( t ) | ≥ a ∈ A ′ ( t ), let ( A ′ a , S init ) ∈ A Y P be the initialized subarena of ( A ′ , S init ) such that only action a is available in t . Initialized arena ( A ′ a , S init ) is covered by M (Lemma 27). By induction hypothesis, forall a ∈ A ′ ( t ), P has a pure memoryless ( A , S init )- X -SP strategy σ a in ( A ′ a , S init ).Let m ∈ M be the memory state corresponding to t in ( A , S init ), that is, if φ is the function witnessingthat ( A , S init ) is covered, m = φ ( t ). The same function φ also witnesses that all the initialized subarenasof ( A , S init ) are covered by M . As ⊑ is X - Y - M -monotone, there exists a ∗ ∈ A ′ ( t ) such that for all w ∈ L m init ,m , for all a ∈ A ′ ( t ), w [ A ′ a ] X t ⊑ w [ A ′ a ∗ ] X t . (9)Notice that as ( A , S init ) is covered by M , c col ( Hists ( A , S init , t )) ⊆ L m init ,m .We now prove that the pure memoryless strategy σ a ∗ is ( A , S init )- X -SP in ( A ′ , S init ). Let ρ ∈ Hists ( A , S init ). We denote w = c col ( ρ ) and s = out ( ρ ).Let σ be any strategy in Σ X ( A ′ , s ). Our goal is to show that σ a ∗ is at least as good as σ in( A ′ , s, ⊑ [ w ] ), i.e., that w Pc σ A ′ ,s ⊑ w Pc σ a ∗ A ′ ,s . We condition P σ A ′ ,s over whether t is visited or not (weassume that t is both visited and not visited with a non-zero probability — otherwise, one of the termsof the following sum is simply 0). We denote by ♦ t the event of visiting state t and by Hists ( A ′ , s, t !)the set of histories in Hists ( A ′ , s, t ) that visit t exactly once (at their last step). We have P σ A ′ ,s = P σ A ′ ,s [ ¬ ♦ t ] · P σ A ′ ,s [ · | ¬ ♦ t ] + P σ A ′ ,s [ ♦ t ] · P σ A ′ ,s [ · | ♦ t ]= P σ A ′ ,s [ ¬ ♦ t ] · P σ A ′ ,s [ · | ¬ ♦ t ] + X ρ ′ ∈ Hists ( A ′ ,s,t !) P σ A ′ ,s [ Cyl ( ρ ′ )] · P σ A ′ ,s [ · | Cyl ( ρ ′ )]= P σ A ′ ,s [ ¬ ♦ t ] · P σ A ′ ,s [ · | ¬ ♦ t ] + X ρ ′ ∈ Hists ( A ′ ,s,t !) P σ A ′ ,s [ Cyl ( ρ ′ )] · ρ ′ P σ [ ρ ′ ] A ′ ,t . By applying operator c col to P σ A ′ ,s and shifting the distribution with w , thanks to Lemma 33 and theprevious equation, we have w Pc σ A ′ ,s = P σ A ′ ,s [ ¬ ♦ t ] · w c col ( P σ A ′ ,s [ · | ¬ ♦ t ]) + X ρ ′ ∈ Hists ( A ′ ,s,t !)s.t. c col ( ρ ′ ) = w ′ P σ A ′ ,s [ Cyl ( ρ ′ )] · ww ′ Pc σ [ ρ ′ ] A ′ ,t . (10)For ρ ′ ∈ Hists ( A ′ , s, t !), w ′ = c col ( ρ ′ ), let us focus on the distribution ww ′ Pc σ [ ρ ′ ] A ′ ,t . Notice that distri-bution Pc σ [ ρ ′ ] A ′ ,t can also be induced by some strategy in Σ X ( Split ( A ′ , t ) , t ) by Lemma 35. Therefore, ww ′ Pc σ [ ρ ′ ] A ′ ,t ∈ ww ′ [ Split ( A ′ , t )] X t = ww ′ [ G a ∈ A ′ ( t ) ( A ′ a , t )] X t . Using the hypotheses, we get ww ′ Pc σ [ ρ ′ ] A ′ ,t ∈ ww ′ [ G a ∈ A ′ ( t ) ( A ′ a , t )] X t ⊑ [ a ∈ A ′ ( t ) ww ′ [ A ′ a ] X t by X - Y - M -selectivity ⊑ ww ′ [ A ′ a ∗ ] X t by (9), which relied on X - Y - M -monotony ⊑ ww ′ Pc σ a ∗ A ′ a ∗ ,t as σ a ∗ is pure memoryless ( A , S init )- X -SP in ( A ′ a ∗ , S init ).Therefore, by using this last equation in (10), thanks to the fact that mixing is useless for ⊑ (or, ifwe consider pure strategies and deterministic arenas, that the sum contains a single term corresponding25o an infinite word), we obtain w Pc σ A ′ ,s ⊑ P σ A ′ ,s [ ¬ ♦ t ] · w c col ( P σ A ′ ,s [ · | ¬ ♦ t ]) + X ρ ′ ∈ Hists ( A ,s,t !)with c col ( ρ ′ )= w ′ P σ A ′ ,s [ Cyl ( ρ ′ )] · ww ′ Pc σ a ∗ A ′ a ∗ ,t . (11)We show that the right-hand side of this inequation can be written as a distribution w Pc τ A ′ a ∗ ,s , for asuitably chosen strategy τ ∈ Σ X ( A ′ a ∗ , s ).Let τ ∈ Σ X ( A ′ a ∗ , s ) be such that τ starts playing like σ and then switches to σ a ∗ as soon as t isvisited; formally, for ρ ′′ ∈ Hists ( A ′ a ∗ , s ), τ ( ρ ′′ ) = ( σ ( ρ ′′ ) if ρ ′′ does not visit tσ a ∗ ( out ( ρ ′′ )) if ρ ′′ visits t. Strategy τ only plays action a ∗ in t , and is therefore a strategy on ( A ′ a ∗ , s ). As τ coincides with σ aslong as t has not been visited, using Lemma 34, we have P σ A ′ ,s [ ¬ ♦ t ] = P τ A ′ a ∗ ,s [ ¬ ♦ t ] , P σ A ′ ,s [ · | ¬ ♦ t ] = P τ A ′ a ∗ ,s [ · | ¬ ♦ t ] , P σ A ′ ,s [ Cyl ( ρ ′ )] = P τ A ′ a ∗ ,s [ Cyl ( ρ ′ )] . Moreover, for all ρ ′ ∈ Hists ( A ′ , s, t !), Pc σ a ∗ A ′ a ∗ ,t = Pc τ [ ρ ′ ] A ′ a ∗ ,t as t is immediately visited. We can therefore replace all terms of the right-hand side of (11) and obtain,using Lemma 33, w Pc σ A ′ ,s ⊑ P τ A ′ a ∗ ,s [ ¬ ♦ t ] · w c col ( P τ A ′ a ∗ ,s [ · | ¬ ♦ t ]) + X ρ ′ ∈ Hists ( A ,s,t !)with c col ( ρ ′ )= w ′ P τ A ′ a ∗ ,s [ Cyl ( ρ ′ )] · ww ′ Pc τ [ ρ ′ ] A ′ a ∗ ,t ⊑ w c col ( P τ A ′ a ∗ ,s [ ¬ ♦ t ] · P τ A ′ a ∗ ,s [ · | ¬ ♦ t ] + X ρ ′ ∈ Hists ( A ,s,t !) P τ A ′ a ∗ ,s [ Cyl ( ρ ′ )] · ρ ′ P τ [ ρ ′ ] A ′ a ∗ ,t )= w c col ( P τ A ′ a ∗ ,s [ ¬ ♦ t ] · P τ A ′ a ∗ ,s [ · | ¬ ♦ t ] + P τ A ′ a ∗ ,s [ ♦ t ] · P τ A ′ a ∗ ,s [ · | ♦ t ])= w c col ( P τ A ′ a ∗ ,s )= w Pc τ A ′ a ∗ ,s . Now since w is the sequence of colors corresponding to ρ ∈ Hists ( A , S init ) and σ a ∗ is ( A , S init )- X -SPin ( A ′ a ∗ , S init ), we have w Pc τ A ′ a ∗ ,s ⊑ w Pc σ a ∗ A ′ a ∗ ,s , which ends the proof. ⊓⊔ We provide an application of Theorem 32 in Section 6, proving that a preference relation admitspure AIFM optimal strategies in its one-player games. The literature provides some sufficient conditionsfor preference relations to admit pure memoryless optimal strategies in one-player stochastic games (forinstance, in [Gim07]). Here, we obtain a full characterization when mixing is useless for ⊑ (in particu-lar, this is a full characterization for real payoff functions), which can deal not only with memorylessstrategies, but also with the more general AIFM strategies. We do not believe that our characterizationis necessarily easy to use in practice; when the sufficient conditions of [Gim07] are verified, they arearguably easier to verify in general. However, there are examples in which these sufficient conditions arenot verified even though pure memoryless strategies suffice (one such example is provided in [BBE10]),and that is where our characterization can help.It is interesting to relate the concepts of monotony and selectivity to other properties from theliterature to simplify the use of our characterization. For instance, if a payoff function f : C ω → R is prefix-independent , then it is also X - Y - M -monotone for any X , Y , and M ; therefore, the problemimmediately reduces to analyzing selectivity in order to study the sufficiency of pure AIFM strategieswith f . We use such a reasoning in the stochastic case of the application of Section 6.26 Application
Let C = N . We illustrate the use of our two main theorems (Theorems 32 and 23) to study the memoryrequirements of the weak parity [Tho08] winning condition W wp = { c c . . . ∈ C ω | max j ≥ c j exists and is even } , which was introduced in Example 8, both in deterministic and in stochastic games. In this example, weabusively use W wp for ⊑ W wp . We say that a word w ∈ C ω is winning if w ∈ W wp , and losing if w / ∈ W wp .As this preference relation can be encoded as a payoff function (namely, the indicator function of W wp ),we have that mixing is useless for W wp . Deterministic games.
We first focus on deterministic games with pure strategies: we show that purememoryless strategies are sufficient. To do so, we first consider one-player games — notice that reasoningabout one-player games of P and of P is very similar, as the objective of P can be rephrased as theobjective of P just by replacing all colors c by c +1. We thus only show arguments from the point of viewof P . We prove that the class A D P of all initialized one-player deterministic arenas of P admits purememoryless P -optimal strategies (i.e., pure P -optimal strategies based on the trivial memory skeleton M triv with a single state) by proving that W wp is P - D - M triv -monotone and P - D - M triv -selective.We start with P - D - M triv -monotony. Let ( A , s ) , ( A , s ) ∈ A D P . Notice that as we are restricted topure strategies in deterministic arenas, notation [ A i ] P s i refers to a set of (Dirac distributions on) infinitewords. For i ∈ { , } let e i = max { max j ≥ c j | c c . . . ∈ [ A i ] P s i ∩ W wp } be the greatest even color reachable in ( A i , s i ) without reaching any greater color (or −∞ if it is notpossible to have an even maximal color).We first deal with the case e = −∞ or e = −∞ . Assume w.l.o.g. that e ≤ e . Let σ ∈ Σ P ( A , s )be a pure strategy achieving a maximal color exactly e . We prove that for any word w ∈ C ∗ (that is,for any word w ∈ L m init ,m init , as the memory skeleton has only one state), we have w [ A ] P s ⊑ w Pc σ A ,s . (12)Let w = c c . . . c n ∈ C ∗ and n w = max ≤ j ≤ n c j . If n w is even or n w ≤ e , then w Pc σ A ,s is a winningword and (12) holds. If not, it means that n w is odd and n w > e ≥ e , in which case all words in w [ A ] P s are necessarily losing, and (12) also holds.We now deal with the case e = e = −∞ , in which there is no way to obtain an even maximal colorboth in A and in A . For i ∈ { , } , let o i = min { max j ≥ c j | c c . . . ∈ [ A i ] P s i } be the minimal greatest color appearing along a play (which is necessarily odd, as an even greatest coloris not possible). Assume w.l.o.g. that o ≥ o , and let σ ∈ Σ P ( A , s ) be a pure strategy achieving amaximal color exactly o . We show again that for all w ∈ C ∗ , w [ A ] P s ⊑ w Pc σ A ,s . (13)Let w = c c . . . c n ∈ C ∗ and n w = max ≤ j ≤ n c j . If all words in w [ A ] P s are losing then (13) is true. Ifthere is a winning word in w [ A ] P s , it means that n w is even and n w > o ≥ o . Hence, w Pc σ A ,s is alsowinning and (13) also holds.In both cases, we have w [ A ] P s ⊑ w Pc σ A ,s ⊑ w [ A ] P s . This proves P - D - M triv -monotony.27e now turn to P - D - M triv -selectivity. Let ( A , s ) , ( A , s ) ∈ A D P . Note that the requirement that c col ( Hists ( A i , s i , s i )) ⊆ L m init ,m init does not bring information as with this particular memory skeleton, allwords are in L m init ,m init . Let w = c c . . . c n ∈ C ∗ and n w = max ≤ j ≤ n c j . We prove that w [( A , s ) ⊔ ( A , s )] P t ⊑ w [ A ] P s ∪ w [ A ] P s . (14)If all words of w [( A , s ) ⊔ ( A , s )] P t are losing, then (14) is true. We now assume that there is a winningword ww ′ ∈ w [( A , s ) ⊔ ( A , s )] P t ; we show that we can also find a winning word in w [ A ] P s ∪ w [ A ] P s .Assume word ww ′ sees its maximal color n in w ′ . If the play corresponding to w ′ in ( A , s ) ⊔ ( A , s )comes back to t after seeing n for the first time, then n is the greatest color on some cycle on t . Thismeans that the strategy repeatedly playing this cycle only takes actions in A or in A and also wins.If the play corresponding to w ′ does not come back to t after seeing n for the first time, then there isa suffix of the play fully in A or in A — this suffix can be played after the first visit to t , and thisgenerates a winning play.Now, assume ww ′ sees its maximal color n in w . If the play corresponding to w ′ comes back to t , thestrategy repeatedly playing this cycle on t is winning, as no color greater than n is seen in this cycle. Ifthere is no cycle on t , it means that w ′ is already an infinite word in [ A ] P s ∪ w [ A ] P s .We have therefore shown (14) in every case; this shows P - D - M triv -selectivity.We have proven that W wp is P - D - M triv -monotone and P - D - M triv -selective; by Theorem 32, thisimplies that pure memoryless strategies are sufficient to play P -optimally in one-player deterministicarenas of P . The same arguments holds from the point of view of P . As we have shown that bothplayers’ one-player arenas admit pure memoryless P -optimal strategies, by Theorem 23, we concludethat both players have pure memoryless P -optimal (even P -SP) strategies in all two-player deterministicarenas. Stochastic games.
Interestingly, memory requirements of W wp are larger in stochastic games but pureAIFM strategies still suffice and we can therefore still apply our results. An example of a one-playerstochastic arena that requires memory is provided in Figure 3. Intuitively, in this case, memory isnecessary for correct risk-assessment: it may sometimes be needed to attempt to get a greater colorwith a smaller probability, and that depends on the current maximal color. In this example, keeping inmemory the greatest color seen is sufficient to play optimally. s s | | ab | | | Fig. 3.
Initialized one-player stochastic arena that requires memory for winning condition W wp . P can win withprobability by taking the risk to play action b only if 1 has been seen before. Notation “ c | p ” next to atransition means that color c is seen, and that this transition is taken with probability p . We generalize this idea and prove that memory skeleton M max = ( N , , ( m, n ) max { m, n } ) sufficesto play optimally in all stochastic arenas for both players (as argued earlier, although this skeleton isinfinite, it is finite as soon as we restrict it to a finite set of colors).We prove that the class A S P of all initialized one-player stochastic arenas of P admits pure memo-ryless G -optimal strategies based on M max by proving G - S - M max -monotony and G - S - M max -selectivity.The weak parity winning condition is not prefix-independent but using the definition of M max , weprove the following related property: for all m ∈ N , for all finite words w , w ∈ L ,m , for all infinitewords w ∈ C ω , w w ∈ W wp ⇐⇒ w w ∈ W wp . (15)28hat is, similar prefixes (in the sense that they reach the same state of the memory skeleton) have thesame influence on the outcome; the winning continuations are the same. Let m ∈ N , w , w ∈ L ,m , and w = c c . . . ∈ C ω . Assume w w ∈ W wp . Let n w = max j ≥ c j . If m ≥ n w , then it means m is even and w w is therefore also winning. If m < n w , then it means that n w is even and w w is also in W wp .Property (15) implies the following for distributions: for µ ∈ Dist ( C ω , F ), for all w , w ∈ L ,m , w µ ( W wp ) = w µ ( W wp ). This implies G - S - M max -monotony: let m ∈ N and ( A , s ) , ( A , s ) ∈ A S P ;assume that for some w ∈ L ,m , we have w.l.o.g. w [ A ] G s ⊑ w [ A ] G s . Then we automatically have thatfor all w ′ ∈ L ,m , we have w ′ [ A ] G s ⊑ w ′ [ A ] G s , which proves G - S - M max -monotony.We now turn to G - S - M max -selectivity. Let m ∈ N and ( A , s ) , ( A , s ) ∈ A S P such that for i ∈ { , } , c col ( Hists ( A i , s i , s i )) ⊆ L m,m . Let w ∈ L ,m . Let A be the arena (( A , s ) ⊔ ( A , s )) w t with mergedstate t .Thanks to the structure of the memory skeleton, we can make the following key observation: anyplay in Plays ( A , s w ) that visits t infinitely many times has a maximal color exactly m ; indeed, m is acolor appearing in w , and if a color greater that m is seen, the memory state cannot go back down to m , so t cannot be visited again (it would contradict that every history from t to t is in L m,m ).Let σ ∈ Σ G ( A , t ). Our goal is to show that it is possible to do at least as well as w Pc σ A ,t [ W wp ]without the need to use actions both in A and in A at t .We first assume that m is even: visiting t infinitely often is therefore winning for P . If there isa strategy that, from t , comes back to t with probability 1, then P can achieve the objective withprobability 1 by repeatedly going back to t . The use of randomization at t is not necessary for thisstrategy: since it goes back to t with probability 1, every action it may play allows going back to t withprobability 1. Thus, such a strategy does not need to use actions both in A and in A , as every timeit leaves t , it can play the same action and repeat the strategy until it reaches t again. The G - S - M max -selectivity is therefore satisfied, as w [ A ] G s or w [ A ] G s contains a strategy that wins with probability 1,which is at least as good as w Pc σ A ,t [ W wp ].Assume now that m is odd or that there is no strategy that comes back to t with probability 1.In the latter case, the probability to go back to t from t has a probability less than 1 − ε for some ε > t infinitely often necessarily has probability 0 for all strategies.We condition w Pc σ A ,t [ W wp ] over which part of the arena the play ends in: either it visits t infinitelyoften (event (cid:3)♦ t ), or it sticks to A or A without visiting t from some point on (events ♦(cid:3) A \ t and ♦(cid:3) A \ t ).We have w Pc σ A ,t [ W wp ] = P σ A ,t [ (cid:3)♦ t ] · w Pc σ A ,t [ W wp | (cid:3)♦ t ] + X i ∈{ , } P σ A ,t [ ♦(cid:3) A i \ t ] · w Pc σ A ,t [ W wp | ♦(cid:3) A i \ t ] . If m is odd, all infinite plays in (cid:3)♦ t are losing; if all strategies visit t infinitely often with probability 0,then P σ A ,t [ (cid:3)♦ t ] = 0: in any case, the first term is 0.We focus on the last two terms. For i ∈ { , } , if the play stays in A i \ t from some point onward,as the value is independent from the actual prefix before the last visit to t (by property (15)), it meansthat it is possible to reach the same value while never going to A − i . That is, there exists a strategy σ i ∈ Σ G ( A i , t ) such that w Pc σ A ,t [ W wp | ♦(cid:3) A i \ t ] ≤ w Pc σ i A i ,t [ W wp ] . We do not prove it formally; a very similar argument can be found in the proof of [Gim07, Theorem 4]:intuitively, it builds a strategy σ i that induces a distribution on the projection of the plays of ( A , t )obtained by σ to plays of ( A i , s i ) (by removing the cycles on t in ( A − i , s − i )). Thus, if we play thebest strategy among σ , which obtains a value at least as good as the part that ends in A \ t , and σ ,which obtains a value at least as good as the part that ends in A \ t , what we obtain is something atleast as good as the value obtained by σ , without needing to consider actions both in A and in A .We have proven that W wp is G - S - M max -monotone and G - S - M max -selective; by Theorem 32, thisimplies that pure strategies based on M max are sufficient to play G -optimally in one-player stochasticarenas of P . The same arguments with the same memory skeleton holds from the point of view of P .29s we have shown that both players’ one-player arenas admit pure G -optimal strategies based on M max ,by Theorem 23, we conclude that both players have pure G -optimal (even G -SP) strategies based on M max ⊗ M max (which corresponds to M max ) in all two-player stochastic arenas. We have studied stochastic games and gave an overview of desirable properties of preference relationsthat admit pure arena-independent finite-memory optimal strategies . Our analysis provides generaltools to help study memory requirements in stochastic games, both with one player (Markov deci-sion processes) and two players, and links both problems. It generalizes both work on deterministicgames [GZ05,BLO +
20] and work on stochastic games [GZ09].A natural question that remains unsolved is the link between memory requirements of a preferencerelation in deterministic and in stochastic games; our results can be called independently to study bothproblems, but do not describe a bridge to go from one to the other yet.
References
AR17. Benjamin Aminof and Sasha Rubin. First-cycle games.
Inf. Comput. , 254:195–216, 2017.BBE10. Tom´as Br´azdil, V´aclav Brozek, and Kousha Etessami. One-counter stochastic games. In KamalLodaya and Meena Mahajan, editors,
IARCS Annual Conference on Foundations of Software Tech-nology and Theoretical Computer Science, FSTTCS 2010, December 15-18, 2010, Chennai, India ,volume 8 of
LIPIcs , pages 108–119. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2010.BDOR20. Thomas Brihaye, Florent Delgrange, Youssouf Oualhadj, and Mickael Randour. Life is random, timeis not: Markov decision processes with window objectives.
Log. Methods Comput. Sci. , 16(4), 2020.BFMM11. Alessandro Bianco, Marco Faella, Fabio Mogavero, and Aniello Murano. Exploring the boundaryof half-positionality.
Ann. Math. Artif. Intell. , 62(1-2):55–77, 2011.BFRR17. V´eronique Bruy`ere, Emmanuel Filiot, Mickael Randour, and Jean-Fran¸cois Raskin. Meet yourexpectations with guarantees: Beyond worst-case synthesis in quantitative games.
Inf. Comput. ,254:259–295, 2017.BHM +
17. Patricia Bouyer, Piotr Hofman, Nicolas Markey, Mickael Randour, and Martin Zimmermann. Bound-ing average-energy games. In Javier Esparza and Andrzej S. Murawski, editors,
Foundations of Soft-ware Science and Computation Structures - 20th International Conference, FOSSACS 2017, Held asPart of the European Joint Conferences on Theory and Practice of Software, ETAPS 2017, Uppsala,Sweden, April 22-29, 2017, Proceedings , volume 10203 of
Lecture Notes in Computer Science , pages179–195, 2017.BHRR19. V´eronique Bruy`ere, Quentin Hautem, Mickael Randour, and Jean-Fran¸cois Raskin. Energy mean-payoff games. In Wan Fokkink and Rob van Glabbeek, editors, , volume140 of
LIPIcs , pages 21:1–21:17. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019.BK08. Christel Baier and Joost-Pieter Katoen.
Principles of model checking . MIT Press, 2008.BLO +
20. Patricia Bouyer, St´ephane Le Roux, Youssouf Oualhadj, Mickael Randour, and Pierre Vandenhove.Games where you can play optimally with arena-independent finite memory. In Konnov and Kov´acs[KK20], pages 24:1–24:22.BMR +
18. Patricia Bouyer, Nicolas Markey, Mickael Randour, Kim G. Larsen, and Simon Laursen. Average-energy games.
Acta Informatica , 55(2):91–127, 2018.BRR17. Rapha¨el Berthon, Mickael Randour, and Jean-Fran¸cois Raskin. Threshold constraints with guar-antees for parity objectives in Markov decision processes. In Ioannis Chatzigiannakis, Piotr Indyk,Fabian Kuhn, and Anca Muscholl, editors, , volume 80 of
LIPIcs , pages121:1–121:15. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2017.CD12. Krishnendu Chatterjee and Laurent Doyen. Energy parity games.
Theor. Comput. Sci. , 458:49–60,2012.CD16. Krishnendu Chatterjee and Laurent Doyen. Perfect-information stochastic games with generalizedmean-payoff objectives. In Martin Grohe, Eric Koskinen, and Natarajan Shankar, editors,
Proceed-ings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, LICS ’16, New York,NY, USA, July 5-8, 2016 , pages 247–256. ACM, 2016. dAH04. Krishnendu Chatterjee, Luca de Alfaro, and Thomas A. Henzinger. Trading memory for random-ness. In , pages 206–217. IEEE Computer Society, 2004.CDGH10. Krishnendu Chatterjee, Laurent Doyen, Hugo Gimbert, and Thomas A. Henzinger. Randomnessfor free. In Petr Hlinen´y and Anton´ın Kucera, editors, Mathematical Foundations of ComputerScience 2010, 35th International Symposium, MFCS 2010, Brno, Czech Republic, August 23-27,2010. Proceedings , volume 6281 of
Lecture Notes in Computer Science , pages 246–257. Springer,2010.CFK +
12. Taolue Chen, Vojtech Forejt, Marta Z. Kwiatkowska, Aistis Simaitis, Ashutosh Trivedi, and MichaelUmmels. Playing stochastic games precisely. In Maciej Koutny and Irek Ulidowski, editors,
CON-CUR 2012 - Concurrency Theory - 23rd International Conference, CONCUR 2012, Newcastle uponTyne, UK, September 4-7, 2012. Proceedings , volume 7454 of
Lecture Notes in Computer Science ,pages 348–363. Springer, 2012.CFK +
13. Taolue Chen, Vojtech Forejt, Marta Z. Kwiatkowska, Aistis Simaitis, and Clemens Wiltsche. Onstochastic games with multiple objectives. In Krishnendu Chatterjee and Jir´ı Sgall, editors,
Math-ematical Foundations of Computer Science 2013 - 38th International Symposium, MFCS 2013,Klosterneuburg, Austria, August 26-30, 2013. Proceedings , volume 8087 of
Lecture Notes in Com-puter Science , pages 266–277. Springer, 2013.CH12. Krishnendu Chatterjee and Thomas A. Henzinger. A survey of stochastic ω -regular games. J.Comput. Syst. Sci. , 78(2):394–413, 2012.CJH04. Krishnendu Chatterjee, Marcin Jurdzinski, and Thomas A. Henzinger. Quantitative stochasticparity games. In J. Ian Munro, editor,
Proceedings of the Fifteenth Annual ACM-SIAM Symposiumon Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11-14, 2004 , pages121–130. SIAM, 2004.CKK17. Krishnendu Chatterjee, Zuzana Kret´ınsk´a, and Jan Kret´ınsk´y. Unifying two views on multiplemean-payoff objectives in Markov decision processes.
Log. Methods Comput. Sci. , 13(2), 2017.CKWW20. Krishnendu Chatterjee, Joost-Pieter Katoen, Maximilian Weininger, and Tobias Winkler. Stochasticgames with lexicographic reachability-safety objectives. In Shuvendu K. Lahiri and Chao Wang,editors,
Computer Aided Verification - 32nd International Conference, CAV 2020, Los Angeles, CA,USA, July 21-24, 2020, Proceedings, Part II , volume 12225 of
Lecture Notes in Computer Science ,pages 398–420. Springer, 2020.Con92. Anne Condon. The complexity of stochastic games.
Inf. Comput. , 96(2):203–224, 1992.CP19. Krishnendu Chatterjee and Nir Piterman. Combinations of qualitative winning for stochastic paritygames. In Wan J. Fokkink and Rob van Glabbeek, editors, , volume 140 of
LIPIcs , pages 6:1–6:17. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2019.CRR14. Krishnendu Chatterjee, Mickael Randour, and Jean-Fran¸cois Raskin. Strategy synthesis for multi-dimensional quantitative objectives.
Acta Inf. , 51(3-4):129–163, 2014.DKQR20. Florent Delgrange, Joost-Pieter Katoen, Tim Quatmann, and Mickael Randour. Simple strategiesin multi-objective MDPs. In Armin Biere and David Parker, editors,
Tools and Algorithms forthe Construction and Analysis of Systems - 26th International Conference, TACAS 2020, Held asPart of the European Joint Conferences on Theory and Practice of Software, ETAPS 2020, Dublin,Ireland, April 25-30, 2020, Proceedings, Part I , volume 12078 of
Lecture Notes in Computer Science ,pages 346–364. Springer, 2020.Dur19. Rick Durrett.
Probability: Theory and Examples . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press, 5th edition, 2019.Gim07. Hugo Gimbert. Pure stationary optimal strategies in Markov decision processes. In WolfgangThomas and Pascal Weil, editors,
STACS 2007, 24th Annual Symposium on Theoretical Aspects ofComputer Science, Aachen, Germany, February 22-24, 2007, Proceedings , volume 4393 of
LectureNotes in Computer Science , pages 200–211. Springer, 2007.GK14. Hugo Gimbert and Edon Kelmendi. Two-player perfect-information shift-invariant submixingstochastic games are half-positional.
CoRR , abs/1401.6575, 2014.GZ05. Hugo Gimbert and Wieslaw Zielonka. Games where you can play optimally without any memory. InMart´ın Abadi and Luca de Alfaro, editors,
CONCUR 2005 - Concurrency Theory, 16th InternationalConference, CONCUR 2005, San Francisco, CA, USA, August 23-26, 2005, Proceedings , volume3653 of
Lecture Notes in Computer Science , pages 428–442. Springer, 2005.GZ09. Hugo Gimbert and Wieslaw Zielonka. Pure and Stationary Optimal Strategies in Perfect-InformationStochastic Games with Global Preferences. Unpublished, 2009. K20. Igor Konnov and Laura Kov´acs, editors. , volume 171 of
LIPIcs .Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2020.Kop06. Eryk Kopczy´nski. Half-positional determinacy of infinite games. In Michele Bugliesi, Bart Preneel,Vladimiro Sassone, and Ingo Wegener, editors,
Automata, Languages and Programming, 33rd In-ternational Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II , volume4052 of
Lecture Notes in Computer Science , pages 336–347. Springer, 2006.Kop08. Eryk Kopczy´nski.
Half-positional Determinacy of Infinite Games . PhD thesis, Warsaw University,2008.LP18. St´ephane Le Roux and Arno Pauly. Extending finite-memory determinacy to multi-player games.
Inf. Comput. , 261(Part):676–694, 2018.LPR18. St´ephane Le Roux, Arno Pauly, and Mickael Randour. Extending finite-memory determinacy byboolean combination of winning conditions. In Sumit Ganguly and Paritosh K. Pandya, editors, , volume 122 of
LIPIcs , pages38:1–38:20. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2018.Mar75. Donald A. Martin. Borel determinacy.
Annals of Mathematics , pages 363–371, 1975.MPR20. Benjamin Monmege, Julie Parreaux, and Pierre-Alain Reynier. Reaching your goal optimally byplaying at random with no memory. In Konnov and Kov´acs [KK20], pages 26:1–26:21.MSTW17. Richard Mayr, Sven Schewe, Patrick Totzke, and Dominik Wojtczak. MDPs with energy-parityobjectives. In , pages 1–12. IEEE Computer Society, 2017.MSTW21. Richard Mayr, Sven Schewe, Patrick Totzke, and Dominik Wojtczak. Simple stochastic games withalmost-sure energy-parity objectives are in NP and coNP.
CoRR , abs/2101.06989, 2021.Put94. Martin L. Puterman.
Markov Decision Processes: Discrete Stochastic Dynamic Programming . WileySeries in Probability and Statistics. Wiley, 1994.Ran13. Mickael Randour. Automated synthesis of reliable and efficient systems through game theory: A casestudy. In
Proc. of ECCS 2012 , Springer Proceedings in Complexity XVII, pages 731–738. Springer,2013.RRS15. Mickael Randour, Jean-Fran¸cois Raskin, and Ocan Sankur. Variations on the stochastic shortestpath problem. In Deepak D’Souza, Akash Lal, and Kim Guldstrand Larsen, editors,
Verification,Model Checking, and Abstract Interpretation - 16th International Conference, VMCAI 2015, Mum-bai, India, January 12-14, 2015. Proceedings , volume 8931 of
Lecture Notes in Computer Science ,pages 1–18. Springer, 2015.RRS17. Mickael Randour, Jean-Fran¸cois Raskin, and Ocan Sankur. Percentile queries in multi-dimensionalMarkov decision processes.
Formal Methods Syst. Des. , 50(2-3):207–248, 2017.Sha53. L. S. Shapley. Stochastic games.
Proceedings of the National Academy of Sciences , 39(10):1095–1100,1953.Tho08. Wolfgang Thomas. Church’s problem and a tour through automata theory. In Arnon Avron, NachumDershowitz, and Alexander Rabinovich, editors,
Pillars of Computer Science, Essays Dedicated toBoris (Boaz) Trakhtenbrot on the Occasion of His 85th Birthday , volume 4800 of
Lecture Notes inComputer Science , pages 635–655. Springer, 2008.VCD +
15. Yaron Velner, Krishnendu Chatterjee, Laurent Doyen, Thomas A. Henzinger, Alexander MosheRabinovich, and Jean-Fran¸cois Raskin. The complexity of multi-mean-payoff and multi-energygames.
Inf. Comput. , 241:177–196, 2015.
A Results on splits
We recall here technical results about split arenas (Definition 26) that are already present in [GZ09]with the slight difference that we consider initialized arenas.Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arena, t ∈ S be a state controlled by P , and( Split ( A , t ) , S t init ) be the split of ( A , S init ) on t .For all a ∈ A ( t ), we build a natural bijection between plays of the arena and plays of its split: it isa function φ a : Hists ( A , S init ) → Hists ( Split ( A , t ) , S a init ) . a : if t has never been visited, it picks action a by default; if t has been visited, it picks the last action played in t . Formally, let ρ = s a s . . . a n s n ∈ Hists ( A , S init ).We define φ a ( ρ ) = s a ′ a s a ′ . . . a n s a ′ n n where we assume as usual that t a ′ i = t , and for i , with 0 ≤ i ≤ n such that s i = t , a ′ i = ( a if for all j < i , s j = t,a k +1 if k is the index of the visit to t preceding s i in ρ. The history φ a ( ρ ) is a history of the split by construction. Function φ a has an inverse ( φ a ) − whichassociates to any history of the split starting in S a init the same history in which all the action labels havebeen removed. We can extend function φ a to a bijection on plays: let φ a ∞ : Plays ( A , S init ) → Plays ( Split ( A , t ) , S a init )be the function such that for π = s a s a s . . . ∈ Plays ( A , S init ), if ρ n = s a s . . . a n s n is a prefix of π ,then φ a ∞ ( π ) = lim n →∞ φ a ( ρ n ).For i ∈ { , } , for all a ∈ A ( t ), we build a natural bijection between strategies of the arena andstrategies of its split: it is a function Φ ai : Σ G i ( A , S init ) → Σ G i ( Split ( A , t ) , S a init ) . For σ i ∈ Σ G i ( A , S init ) a strategy of P i , we define Φ ai ( σ i ) = σ i ◦ ( φ a ) − . Similarly, this function has aninverse: for σ i ∈ Σ G i ( Split ( A , t ) , S a init ), we define ( Φ ai ) − ( σ i ) = σ i ◦ φ a .We prove a few results about these bijections, which correspond to [GZ09, Proposition 10 andLemma 12]. Lemma 35.
Let a ∈ A ( t ) , s ∈ S init , σ ∈ Σ G ( A , S init ) , and σ ∈ Σ G ( A , S init ) . We have Pc σ ,σ A ,s = Pc Φ a ( σ ) ,Φ a ( σ ) Split ( A ,t ) ,s a . Let ⊑ a preference relation and X ∈ { PFM , P , GFM , G } be a type of strategies. For i ∈ { , } , strategy σ i ispure (resp. finite-memory) if and only if strategy Φ ai ( σ i ) is pure (resp. finite-memory). For i ∈ { , } , σ i is X -optimal in ( A , S init , ⊑ ) if and only if Φ ai ( σ i ) is X -optimal in ( Split ( A , t ) , S a init , ⊑ ) . Moreover, ( σ , σ ) is an X -NE in ( A , S init , ⊑ ) if and only if ( Φ a ( σ ) , Φ a ( σ )) is an X -NE in ( Split ( A , t ) , S a init , ⊑ ) .Proof. We prove the first claim by showing the equality of two distributions in
Dist ( Plays ( A , s ) , F ( A ,s ) ): P σ ,σ A ,s [ · ] = P Φ a ( σ ) ,Φ a ( σ ) Split ( A ,t ) ,s a [ φ a ∞ ( · )] . (16)We prove the equality for cylinders Cyl ( ρ ) with ρ ∈ Hists ( A , S init ). Notice that φ a ∞ ( Cyl ( ρ )) = Cyl ( φ a ( ρ )).Thanks to our construction of functions Φ a and Φ a , an easy induction on the length of ρ shows that P σ ,σ A ,s [ Cyl ( ρ )] = P Φ a ( σ ) ,Φ a ( σ ) Split ( A ,t ) ,s a [ Cyl ( φ a ( ρ ))] . Since cylinders generate the σ -algebra, this proves (16).Now notice that the bijection on plays φ a ∞ preserves the sequence of colors seen, i.e., for all π ∈ Plays ( A , S init ), c col ( π ) = c col ( φ a ∞ ( π )). Using the definition of Pc , we can therefore conclude that Pc σ ,σ A ,s = Pc Φ a ( σ ) ,Φ a ( σ ) Split ( A ,t ) ,s a . The claims about X -optimality and X -NE follow from the first one: as Φ ai is a bijection that preservesthe induced distributions on colors, it also preserves X -optimality and X -NE.33ijection Φ ai clearly preserves the “pure” feature of strategies by construction, in both directions.Now if σ i is finite-memory, then Φ ai ( σ i ) does not need any more memory to play as it has access to thesame information and the last action played in t . In the other direction, if Φ ai ( σ i ) is finite-memory, then σ i can play with the same memory plus extra information about the last action that was played in t .Therefore, σ i might need more memory than Φ ai ( σ i ), but that memory stays finite. ⊓⊔ Lemma 36.
Let ⊑ be a preference relation and X ∈ { PFM , P , GFM , G } be a type of strategies. Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arena, and ( Split ( A , t ) , S t init ) be its split on t for some t ∈ S . Assume ( σ , σ ) is an X -NE in ( Split ( A , t ) , S a ∗ init ) for some a ∗ ∈ A ( t ) . If σ is pure memoryless,and σ ( t ) = a ∗ , then there exists an X -NE ( σ ′ , σ ′ ) in ( A , S init ) such that σ ′ is pure memoryless.Proof. We set ( σ ′ , σ ′ ) = (( Φ a ∗ ) − ( σ ) , ( Φ a ∗ ) − ( σ )), which is an X -NE in ( A , S init ) by Lemma 35.Moreover, Lemma 35 shows that σ ′ is pure. It is left to prove that in this particular case, σ ′ is memoryless.Let ρ ∈ Hists ( A , S init ) be a history consistent with σ ′ . We know that if out ( ρ ) = t , then σ ′ ( ρ ) = a ∗ because this is the only possible action played in t by σ in Split ( A , t ). Now assume out ( ρ ) = t . Noticethat since any action taken at t is necessarily a ∗ (and since we start in S a ∗ init ), any state appearing along φ a ∗ ( ρ ) (except t ) is necessarily labeled by a ∗ . Therefore, we have σ ′ ( ρ ) = σ ( φ a ∗ ( ρ )) = σ ( s a ∗ ), whichonly depends on s . ⊓⊔ Lemma 37.
Let ⊑ be a preference relation. Let ( A = ( S , S , A, δ, col ) , S init ) be an initialized arena,and ( Split ( A , t ) , S t init ) be its split on t for some t ∈ S . Let σ ∈ Σ PFM ( Split ( A , t ) , S t init ) and σ ∈ Σ G ( Split ( A , t ) , S t init ) . Assume σ is pure memoryless, and let a ∗ = σ ( t ) . Let ( A a ∗ , S a ∗ init ) be the initializedsubarena of ( A , S init ) in which only action a ∗ is available in t and states are renamed s s a ∗ . Let σ a ∗ and σ a ∗ be the restrictions of σ and σ to Hists ( A a ∗ , S a ∗ init ) , which are strategies on ( A a ∗ , S a ∗ init ) . Thenfor all s a ∗ ∈ S a ∗ init , Pc σ ,σ Split ( A ,t ) ,s a ∗ = Pc σ a ∗ ,σ a ∗ A a ∗ ,s a ∗ . Proof.
Let s a ∗ ∈ S a ∗ init . Notice that any play in Split ( A , t ) consistent with σ starting in s a ∗ only visitsstates among S a ∗ , since it starts there and every action played in t is a ∗ . This shows that σ a ∗ and σ a ∗ are indeed well-defined strategies on ( A a ∗ , S a ∗ init ). Moreover, it shows that Pc σ ,σ Split ( A ,t ) ,s a ∗ = Pc σ a ∗ ,σ a ∗ A a ∗ ,s a ∗ since every infinite play stays in the subarena of Split ( A , t ) corresponding to A a ∗ . ⊓⊔ B Missing proofs of Section 5
We prove the two technical lemmas from Section 5 that were stated without proof.
Lemma 33.
Let ( A = ( S , S , A, δ, col ) , s ) be an initialized one-player arena and ρ ∈ Hists ( A , s ) . Let µ be a distribution on plays in Dist ( Plays ( A , out ( ρ )) , F ( A , out ( ρ )) ) . We have c col ( ρµ ) = c col ( ρ ) c col ( µ ) . Formally, our memory model is based on colors, and not on actions. However, we can easily enrich the gamegraph with a new color for each action available at t , and a memory skeleton (which reads colors) can thenremember the last action played at t . roof. Let E ∈ F be an event about infinite sequences of colors. Then, c col ( ρµ )( E ) = ρµ ( c col − ( E )) by definition of c col on distributions= µ ( { π ∈ Plays ( A , out ( ρ )) | ρπ ∈ c col − ( E ) } ) by definition of ρµ = µ ( { π ∈ Plays ( A , out ( ρ )) | c col ( ρπ ) ∈ E } )= µ ( { π ∈ Plays ( A , out ( ρ )) | c col ( ρ ) c col ( π ) ∈ E } )= ( µ ◦ c col − )( { w ′ ∈ C ω | c col ( ρ ) w ′ ∈ E } )= c col ( µ )( { w ′ ∈ C ω | c col ( ρ ) w ′ ∈ E } ) by definition of c col on distributions= ( c col ( ρ ) c col ( µ ))( E ) by definition of c col ( ρ ) c col ( µ ). ⊓⊔ Lemma 34.
Let ( A , s ) ∈ A Y P be an initialized one-player arena and σ , τ ∈ Σ G ( A , s ) be two strategies.Let ρ = s a s . . . a n s n ∈ Hists ( A , s ) . We say that σ coincides with τ on ρ if for each prefix ρ i = s a s . . . a i s i of ρ with ≤ i < n , σ ( ρ i ) = τ ( ρ i ) . If σ coincides with τ on ρ , then P σ A ,s [ Cyl ( ρ )] = P τ A ,s [ Cyl ( ρ )] . Let t be a state of A . We write ¬ ♦ t for the event in F ( A ,s ) that consists of all the infinite plays thatnever visit t . Assume that for all ρ = s a s . . . a n s n ∈ Hists ( A , s ) such that for all i , ≤ i < n , s i = t , σ coincides with τ on ρ . Then P σ A ,s [ ¬ ♦ t ] = P τ A ,s [ ¬ ♦ t ] and, if P σ A ,s [ ¬ ♦ t ] > , P σ A ,s [ · | ¬ ♦ t ] = P τ A ,s [ · | ¬ ♦ t ] . Proof.
Assume A = ( S , S , A, δ, col ). Using the definition of P ·A ,s [ Cyl ( ρ )] and the hypothesis, we have P σ A ,s [ Cyl ( ρ )] = n Y i =1 σ ( ρ i − )( a i ) · δ ( s i − , a i , s i ) = n Y i =1 τ ( ρ i − )( a i ) · δ ( s i − , a i , s i ) = P τ A ,s [ Cyl ( ρ )]which proves the first claim.Let H = { Cyl ( ρ ) | ρ ∈ Hists ( A , s ) and σ coincides with τ on ρ } . By the first claim, P σ A ,s and P τ A ,s are equal on H . As this class is closed by intersection, by the monotone class lemma, P σ A ,s and P τ A ,s arealso equal on the smallest σ -algebra generated by H .For E ∈ F ( A ,s ) an event, we denote by E c = Plays ( A , s ) \ E its complement; notice that ¬ ♦ t = (cid:16) [ n ∈ N Cyl ( S \ { t } , A, . . . , S \ { t } , A | {z } n times “ S \ { t } , A ” , t ) (cid:17) c = \ n ∈ N Cyl ( S \ { t } , A, . . . , S \ { t } , A | {z } n times “ S \ { t } , A ” , t ) c , where Cyl ( S \ { t } , A, . . . , S \ { t } , A, t ) refers to the union over all cylinders of histories that go througha certain number of states that are not t , and then end in t . Event ¬ ♦ t can therefore be expressed withcomplements and countable intersection of cylinders in H , so P σ A ,s [ ¬ ♦ t ] = P τ A ,s [ ¬ ♦ t ], which proves thesecond claim.We now assume that P σ A ,s [ ¬ ♦ t ] >
0. Using the definition of conditional probabilities, to prove that P σ A ,s [ · | ¬ ♦ t ] = P τ A ,s [ · | ¬ ♦ t ], it is left to prove that for all events E ∈ F ( A ,s ) , P σ A ,s [ E ∩ ¬ ♦ t ] = P τ A ,s [ E ∩ ¬ ♦ t ]. We first prove this equality if E is a cylinder Cyl ( ρ ) for some ρ ∈ Hists ( A , s ). If ρ visits t , then E ∩ ¬ ♦ t = ∅ and P σ A ,s [ E ∩ ¬ ♦ t ] = P τ A ,s [ E ∩ ¬ ♦ t ] = 0. If ρ does not visit t , then E is an elementof H . Therefore, E ∩ ¬ ♦ t can be expressed with complements and countable intersection of elementsof H , so P σ A ,s [ E ∩ ¬ ♦ t ] = P τ A ,s [ E ∩ ¬ ♦ t ]. Since measures P σ A ,s [ · ∩ ¬ ♦ t ] and P τ A ,s [ · ∩ ¬ ♦ t ] are equal onall cylinders, they are also equal on all the events in F ( A ,s ) . ⊓⊔⊓⊔