Simple Stochastic Games with Almost-Sure Energy-Parity Objectives are in NP and coNP
SSimple Stochastic Games with Almost-SureEnergy-Parity Objectives are in NP and coNP
Richard Mayr , Sven Schewe , Patrick Totzke , andDominik Wojtczak (cid:66) ) University of Edinburgh, Edinburgh, UK University of Liverpool, Liverpool, UK
Abstract.
We study stochastic games with energy-parity objectives,which combine quantitative rewards with a qualitative ω -regular condition:The maximizer aims to avoid running out of energy while simultaneouslysatisfying a parity condition. We show that the corresponding almost-sureproblem, i.e., checking whether there exists a maximizer strategy thatachieves the energy-parity objective with probability 1 when starting ata given energy level k , is decidable and in NP ∩ coNP . The same holdsfor checking if such a k exists and if a given k is minimal. Keywords:
Simple Stochastic Games, Parity Games, Energy Games
Simple stochastic games (SSGs), also called competitive Markov decision processes [30], or 2 -player games [23,22] are turn-based games of perfect informationplayed on finite graphs. Each state is either random or belongs to one of theplayers (maximizer or minimizer). A game is played successively moving a pebblealong the game graph, where the next state is chosen by the player who ownsthe current one or, in the case of random states, according to a predefineddistribution. This way, an infinite run is produced. The maximizer tries to achievean objective (in our case almost surely), while the minimizer tries to prevent this.The maximizer can be seen as a controller trying to ensure an objective in theface of both known random failure modes (encoded by the random states) andan unknown or hostile environment (encoded by the minimizer player).Stochastic games were first introduced in Shapley’s seminal work [48] in 1953and have since then played a central role in the solution of many problemsin computer science, including synthesis of reactive systems [46,42]; checkinginterface compatibility [27]; well-formedness of specifications [28]; verification ofopen systems [4]; and many others.A huge variety of objectives for such games was already studied in theliterature. We will mainly focus on three of them in this paper: parity; mean-payoff; and energy objectives. In order to define them we assume that numericrewards are assigned to transitions, and priorities (encoded by bounded non-negative numbers) are assigned to states. a r X i v : . [ c s . G T ] J a n The parity objective simply asks that the minimal priority that appearsinfinitely often in a run is even. Such a condition is a canonical way to definedesired behaviors of systems, such as safety, liveness, fairness, etc.; it subsumesall ω -regular objectives. The algorithmic problem of deciding the winner in non-stochastic parity games is polynomial-time equivalent to the model checking ofthe modal µ -calculus [51] and is at the center of the algorithmic solutions to theChurch’s synthesis problem [45]. But the impact of parity games goes well beyondautomata theory and logic: They facilitated the solution of two long-standingopen problems in stochastic planning [29] and in linear programming [32], whichwas done by careful adaptation of the parity game examples on which the strategyimprovement algorithm [31] requires exponentially many iterations.The parity objective can be seen as a special case of the mean-payoff ob-jective that asks for the limit average reward per transition along the run tobe non-negative. Mean-payoff objectives are among the first objectives studiedfor stochastic games and go back to a 1957 paper by Gillette [33]. They allowfor reasoning about the efficiency of a system, e.g., how fast it operates onceoptimally controlled.The energy objective [14] can be seen as a refinement of the mean-payoffobjective. It asks for the accumulated reward at any point of a run not to belower than some finite threshold. As the name suggests, it is useful when reasoningabout systems with a finite initial energy level that should never become depleted.Note that the accumulated reward is not bounded a-priori, which essentiallyturns a finite-state game into an infinitely-state one.In this paper we consider SSGs with energy-parity objectives, which requiresruns to satisfy both an energy and a parity objective. It is natural to considersuch an objective for systems that should not only be correct, but also energyefficient. For instance, consider a robot maintaining a nuclear power plant. Wenot only require the robot to correctly react to all possible chains of events(parity objective for functional correctness), but also never to run out of energyas charging it manually would be risky (energy objective).While the complexity of games with single objectives is often in NP ∩ coNP ,asking for multiple objectives often makes solving games harder. Parity gamesare commonly viewed as the simplest of these objectives, and some traditionalsolutions for non-stochastic games go through simple reductions to mean-payoff orenergy conditions (which are quite similar in non-stochastic games) to discountedpayoff games that establishes the membership of those problems in UP and coUP[36]. However, asking for two parity objectives to be satisfied at the same timeleads to coNP completeness [21].We study the almost sure satisfaction of the energy-parity objective, i.e.,with probability 1. Such qualitative analysis is important as there are manyapplications where we need to know whether the correct behavior arises almost-surely, e.g., in the analysis of randomized distributed algorithms (see, e.g, [43,49])and safety-critical examples like the one from above. Moreover, the algorithmsfor quantitative analysis , i.e., computing the optimal probability of satisfaction,typically start by performing the qualitative analysis first and then solving a game with a simpler objective (see, e.g., [23,15]). Finally, there are stochasticmodels for which qualitative analysis is decidable but quantitative one is not(e.g., probabilistic finite automata [6]). This may also be the case for our model. Our contributions.
We consider stochastic games with energy-parity winningconditions and show that deciding whether maximizer can win almost-surely fora given initial energy level k is in NP ∩ coNP . We show the same for checking ifsuch k exists at all and checking if a given k is the smallest possible for which thisholds. The proofs are considerably harder than the corresponding result for MDPs[41] (on which they are partly based), because the attainable mean-payoff valueis no longer a valid criterion in the analysis (via combinations of sub-objectives).E.g., even though the stored energy might be inexorably drifting towards + ∞ (resp. −∞ ), the mean-payoff value might still be zero because the minimizer(resp. maximizer) can delay payoffs for longer and longer (though not indefinitely,due to the parity condition). Moreover, the minimizer might be able to choosebetween different ways of losing and never commit to any particular way afterany finite prefix of the play (see Example 1).Our proof characterizes almost-sure energy-parity via a recursive combinationof complex sub-objectives called Gain and
Bailout , which can each eventually besolved in NP ∩ coNP .Our proof of the coNP membership is based on a result on the strategycomplexity of a natural class of objectives, which is of independent interest. Weshow (cf. Theorem 6; based on previous work in [35]) that, if an objective O issuch that its complement is both shift-invariant and submixing, and that everyMDP admits optimal finite-memory deterministic maximizer strategies for O ,then the same is true in turn-based stochastic games. Example 1.
Figure 1 shows an energy-parity game that the maximizer can winalmost surely when starting with an energy level of ≥ ≥
3, then themaximizer can turn left and has at least chance that the energy level willnever drop to 2 while wining the game with priority 2. This is because we canview this process as a random walk on a half line. If x n is the probability of , , − −
22 30
Fig. 1: A SSG with two maximizer states ( (cid:50) ), one minimizer state ( (cid:51) ) and oneprobabilistic state ( (cid:35) ). Each state is annotated with its priority and each edgewith a reward by which it increases the energy level (respectively, decreases if thereward is negative). The maximizer wins if the lowest priority visited infinitelyoften is even and the energy level never drops below 0. reaching energy level 2 when starting at n then these probabilities are the leastpoint-wise positive solution of the following system of linear equations: x = 1, x n = x n +1 + x n − for all n ≥
3. We then get that x n = n − so the probabilityof not reaching energy 2 is ≥ for all n ≥
3. Always turning left guarantees that,almost surely, the parity condition holds and the limes inferior of the energylevel is not −∞ . We call this condition Gain . Strategies for
Gain can be usedwhen the energy level is sufficiently high (at least 3 in our example) to win witha positive probability.However, if maximizer plays for Gain and always moves left, then for everyinitial energy level the chance of eventually dropping the energy down to level 2is positive, due to the negative cycle. When that happens, the only other optionfor the maximizer is to move right. There minimizer can ‘choose how to lose’,via a disjunction of two conditions that we later formalize as
Bailout . Eitherminimizer goes back to the start state without changing the energy level (thusmaximizer wins as the energy stays at level 2 and only the good priority 2 isseen), or minimizer turns right. In the latter case, the play visits a dominatingodd priority (which is bad for maximizer) but also increases the energy by 1,which allows maximizer to switch back to playing left for the
Gain conditionuntil energy level 2 is reached again.Our maximizer strategies are a complex interplay between
Bailout and
Gain .In the example, it is easy to see that the probability of seeing priority 1 infinitelyoften is zero if maximizer follows the just described strategy (the probabilityof requiring to go right more than n times is at most ( ) n ), so maximizer winsthis energy-parity game almost surely. Note that maximizer does not win almostsurely when the initial energy level is 0 or 1. Previous work on combined objectives.
Non-stochastic energy-parity gameshave been studied in [16]. They can be solved in NP ∩ coNP and maximizerstrategies require only finite (but exponential) memory, a property that alsoallowed to show P-time inter-reducibility with mean-payoff parity games. Morerecently they were also shown to be solvable in pseudo-quasi-polynomial time [26].Related results on non-stochastic games (e.g., mean-payoff parity) are summarizedin [18].Most existing work on combined objectives for stochastic systems [17,18,9,41]is restricted to Markov decision processes (MDPs; aka 1 -player games). Almost-sure energy-parity objectives for MDPs were first considered in [17,18], where adirect reduction to ordinary energy games was proposed. This reduction relies onthe assumption that maximizer can win using finite memory if at all. Unfortu-nately, this assumption does not necessarily hold: it was shown in [41] that analmost sure winning strategy for energy-parity in finite MDPs may require infinitememory. Nevertheless, it was possible to recover the original result, that decidingthe existence of a.s. winning strategies is in NP ∩ coNP (and pseudo-polynomialtime), by showing that the existence of an a.s. winning strategy can be witnessedby the existence of two compatible, and finite-memory, winning strategies for two simpler objectives. We generalize this approach from MDPs to full stochasticgames.Stochastic mean-payoff parity games were studied in [20], where it was shownthat they can be solved in NP ∩ coNP . However, this does not imply a solution forstochastic energy-parity games, since, unlike in the non-stochastic case [16], thereis no known reduction from energy-parity to mean-payoff parity in stochasticgames. (The reduction in [16] relies on the fact that maximizer has a winning finite-memory strategy for energy-parity, which does not generally hold for stochasticgames or MDPs; see above.)A related model are the 1-counter MDPs (and stochastic games) studied in[12,11,8], since the value of the counter can be interpreted as the stored energy.These papers consider the objective of reaching counter value zero (which isdual to the energy objective of staying above zero), thus the roles of minimizerand maximizer are swapped. However, unlike in this paper, these works do notcombine termination objectives with extra parity conditions. Structure of the paper.
The rest of the paper is organized as follows. Westart by introducing the notation and formal definitions of games and objectivesin the next section. In Section 3 we show how checking almost-sure energy-parityobjectives can be characterized in terms of two newly defined auxiliary objectives:Gain and Bailout. In Sections 4 and 5, we show that almost-sure Bailout andGain objectives, respectively, can be checked in NP and coNP . Section 6 containsour main result: NP and coNP algorithms for checking almost-sure energy-paritygames with a known and unknown initial energy, as well as checking if a giveninitial energy is the minimal one. We conclude and point out some open problemsin Section 7. Due to page restrictions, most proofs in the main body of the paperwere replaced by sketches. The detailed proofs can be found in the appendix. A probability distribution over a set X is a function f : X → [0 ,
1] such that (cid:80) x ∈ X f ( x ) = 1. We write D ( X ) for the set of distributions over X . Games, Strategies, Measures. A Simple Stochastic Game (SSG) is a directedgraph G def = ( V, E, λ ), where all states have an outgoing edge and the set ofstates is partitioned into states owned by maximizer ( V (cid:50) ), minimizer ( V (cid:51) ) andprobabilistic states ( V (cid:35) ). The set of edges is E ⊆ V × V and λ : V (cid:35) → D ( E )assigns each probabilistic state a probability distribution over its outgoing edges.W.l.o.g., we assume that each probabilistic state has at most two successors,because one can introduce a new probabilistic state for each excess successor. Welet λ ( ws ) def = λ ( s ) for all ws ∈ ( V E ) ∗ V (cid:35) .A path is a finite or infinite sequence ρ def = s e s e . . . such that e i =( s i , s i +1 ) ∈ E holds for all indices i . A run is an infinite path and we write Runs def = (
V E ) ω for the set of all runs. A strategy for maximizer is a function σ : ( V E ) ∗ V (cid:50) → D ( E ) that assignsto each path ws ∈ ( V E ) ∗ V (cid:50) a probability distribution over the outgoing edgesof its target node s . That is, σ ( ws )( e ) > e = ( s, t ) ∈ E for some t ∈ V . A strategy is called memoryless if σ ( xs ) = σ ( ys ) for all x, y ∈ ( V E ) ∗ and s ∈ V (cid:50) , deterministic if σ ( w ) is Dirac for all w ∈ ( V E ) ∗ V (cid:50) , and finite-state if there exists an equivalence relation ∼ on ( V E ) ∗ V (cid:50) with a finite index, suchthat σ ( ρ ) = σ ( ρ ) if ρ ∼ ρ . Of particular interest to us will be the classof memoryless deterministic strategies ( MD ) and the class of finite-memorydeterministic strategies ( FD ). Strategies for minimizer are defined analogouslyand will usually be denoted by τ : ( V E ) ∗ V (cid:51) → D ( E ).A maximizing (minimizing) Markov Decision Process (MDP) is a game inwhich minimizer (maximizer) has no choices, i.e., all her states have exactly onesuccessor. We will write G [ τ ] for the MDP resulting from fixing the strategy τ . A Markov chain is a game where neither player has a choice. In particular, G [ σ, τ ] isa Markov chain obtained by setting, in the game G , the strategies for maximizerand minimizer to σ and τ , respectively.Given an initial state s ∈ V and strategies σ and τ for maximizer andminimizer, respectively, the set of runs starting in s naturally extends to aprobability space as follows. We write Runs G w for the w -cylinder , i.e., the set of allruns with prefix w ∈ ( V E ) ∗ V . We let F G be the σ -algebra generated by all thesecylinders. We inductively define a probability function P G,σ,τs on all cylinders,which then uniquely extends to F G by Carath´eodory’s extension theorem [5], bysetting P G ,σ,τs ( Runs G s ) def = 1 and P G ,σ,τs ( Runs G w ) def = (cid:81) n − i =0 dist i ( s e s e . . . s i )( e i )for w = s e s e . . . e n − s n , where s = s , e i = ( s i , s i +1 ) and dist i is σ ( · ), τ ( · )or λ ( · ), for s i ∈ V (cid:50) , V (cid:51) or V (cid:35) , respectively. Objective Functions.
A (Borel) objective is a set
Obj ∈ F G of runs. We write Obj def = Runs \ Obj for its complement. Borel objectives
Obj are weakly determined[40,39], which means thatsup σ inf τ P σ,τs ( Obj ) = inf τ sup σ P σ,τs ( Obj ) . This quantity is called the value of Obj in state s , and written as Val G s ( Obj ). Wesay that
Obj holds almost-surely (abbreviated as a.s. ) at state s iff there exists σ such that ∀ τ, P G ,σ,τs ( Obj ) = 1. Let AS G ( Obj ) denote the set of states at which
Obj holds almost surely. We will drop the superscript G and simply write Runs , P σ,τs and AS ( Obj ), if the game is clear from the context.We use the syntax and semantics of operators F (eventually) and G (always)from the temporal logic LTL [25] to specify some conditions on runs.A reachability condition is defined by a set of target states T ⊆ V . A run ρ = s e s . . . satisfies the reachability condition iff there exists an i ∈ N s.t. s i ∈ T . We write F T ⊆ Runs for the set of runs that satisfy this reachabilitycondition. Given a set of states W ⊆ V , we lift this to a safety condition on runsand write G W ⊆ Runs for the set of runs ρ = s e s . . . where ∀ i. s i ∈ W . A parity condition is given by a bounded function parity : V → N that assignsa priority (a non-negative integer) to each state. A run ρ ∈ Runs satisfies theparity condition iff the minimal priority that appears infinitely often on the runis even. The parity objective is the subset
PAR ⊆ Runs of runs that satisfy theparity condition.
Energy conditions are given by a function r : E → Z , that assigns a reward value to each edge. For a given initial energy value k ∈ N , a run s e s e . . . satisfies the k -energy condition if, for every finite prefix of length n , the energylevel k + (cid:80) ni =0 r ( e i ) is greater or equal to 0. Let EN ( k ) ⊆ Runs denote the k -energy objective, consisting of those runs that satisfy the k -energy condition.The l -storage condition holds for a run s e s e . . . if l + (cid:80) n − i = m r ( s i , s i +1 ) ≥ s m e m s m +1 . . . s n . Let ST ( k, l ) ⊆ Runs denote the k -energy l -storage objective, consisting of those runs that satisfy both the k -energy andthe l -storage condition. We write ST ( k ) for (cid:83) l ST ( k, l ). Clearly, ST ( k ) ⊆ EN ( k ). Mean-payoff and limit-payoff conditions are defined w.r.t. the same rewardfunction as the energy conditions. The mean-payoff value of a run ρ = s e s e . . . is MP ( ρ ) def = lim inf n →∞ n (cid:80) n − i =0 r ( e i ). For (cid:52) ∈ { >, ≥ , = , ≤ , < } and c ∈ R ∪{−∞ , ∞} , the set MP ( (cid:52) c ) ⊆ Runs consists of all runs ρ with MP ( ρ ) (cid:52) c . Let LimInf ( (cid:52) c ) ⊆ Runs contain all runs ρ with (lim inf n →∞ (cid:80) ni =0 r ( e i )) (cid:52) c , andlikewise for LimSup ( (cid:52) c ).The combined energy-parity objective EN ( k ) ∩ PAR is Borel and thereforeweakly determined, meaning that it has a well-defined (inf sup = sup inf) valuefor every game [40,39]. Moreover, the almost-sure energy-parity objective (askingto win with probability 1) is even strongly determined [38]: either maximizer hasa strategy to enforce the condition with probability 1 or minimizer has a strategyto prevent this.
The main theorem of this section (Theorem 5) characterizes almost sure energy-parity objectives in terms of two intermediate objectives called
Gain and k - Bailout for parameters k ≥
0. This will form the basis of all computability results: wewill show (as Theorems 14, 17 and 18) how to compute almost-sure sets for theseintermediate objectives.
Definition 2.
Consider a finite SSG G = ( V, E, λ ) , as well as reward and parityfunctions defining the objectives PAR , LimInf ( > ∞ ) , LimSup (= ∞ ) as well as ST ( k, l ) and EN ( k ) for every k, l ∈ N . We define combined objectives Gain and k - Bailout def = ∪ l Bailout ( k, l ) where Gain def = LimInf ( > −∞ ) ∩ PARBailout ( k, l ) def = ( ST ( k, l ) ∩ PAR ) ∪ ( EN ( k ) ∩ LimSup (= ∞ )) . The main idea behind these two objectives is a special witness property forenergy-parity. We argue that, if maximizer has an almost-sure winning strategy for energy-parity then he also has one that combines two almost-sure winningstrategies, one for
Gain and one for k - Bailout .Notice that playing an almost-sure winning strategy for
Gain implies a uni-formly lower-bounded strictly positive chance that the energy level never dropsbelow zero (assuming it is sufficiently high to begin with). This fact uses thefiniteness of the set of control-states and does not hold for infinite-state MDPs. Inthe unlikely event that the energy level does get close to zero, maximizer switchesto playing an almost sure winning strategy for k - Bailout . This is a disjunction oftwo scenarios, and the balance might be influenced by minimizer’s choices. In thefirst scenario ( ST ( k, l ) ∩ PAR ) the energy never drops much and stays above zero(thus satisfying energy-parity). In the second scenario, ( EN ( k ) ∩ LimSup (= ∞ )),the parity objective is temporarily suspended in favor of boosting (while alwaysstaying above zero) the energy to a sufficiently high level to switch back to thestrategy for Gain and thus try again from the beginning. The probability ofinfinitely often switching between these modes is zero due to the lower-boundedchance of success in the
Gain phase. Therefore, maximizer eventually wins byplaying for
Gain . Note that maximizer needs to remember the current energylevel in order to know when to switch and consequently, this strategy uses infinitememory.
Example 3.
Consider again the game in Fig. 1. The middle left state satisfiesboth
Gain and k - Bailout objectives for all k ≥ Gain or always go right for k - Bailout when at that state. Note that it neither satisfies 0-
Bailout nor 1-
Bailout objectives.We define the subset W ⊆ V of states from which maximizer can almostsurely win both Gain and k - Bailout (assuming sufficiently high initial energy),while at the same time ensuring that the play remains within this set of states.These are the states from which maximizer can win by freely combining individualstrategies for the
Gain and
Bailout objectives.
Definition 4.
Given a finite SSG G = ( V, E, λ ) , let W ⊆ V be the largest subsetof states satisfying the following condition W ⊆ AS ( Gain ∩ G W ) ∩ (cid:91) k AS ( k - Bailout ∩ G W )This condition describes a fixed-point, and as it is easy to see that if twosets W and W are such fixed-points, then so is W ∪ W . Thus, the maximalfixed-point W is well-defined.Our main characterization of almost-sure energy-parity objectives is thefollowing Theorem 5. It states that maximizer can almost surely win an EN ( k ) ∩ PAR objective if, and only if, he can win the easier k -Bailout objective whilealways staying in the safe set W . Theorem 5.
For every k ∈ N , AS ( EN ( k ) ∩ PAR ) = AS ( k - Bailout ∩ G W ) . Our proof of this characterization theorem relies on the following claim, whichallows to lift the existence of finite-memory deterministic optimal strategies fromMDPs to SSGs. It applies to a fairly general class of objectives and, we believe,is of independent interest.Recall that
Obj def = Runs \ Obj denotes the complement of objective
Obj . Forruns a, b, c ∈ Runs we say that a is a shuffle of b and c if there exist factorizations b = b b . . . and c = c c . . . such that a = b c b c . . . . An objective Obj iscalled submixing if, for every run a ∈ Obj that is a shuffle of runs b and c , either b ∈ Obj or c ∈ Obj . Obj is shift-invariant if, for every run s e s e . . . , it holdsthat s e s e . . . ∈ Obj ⇐⇒ s e . . . ∈ Obj . Shift-invariance slightly generalizesthe better-known tail condition (see [35] for a discussion).
Theorem 6.
Let O be an objective such that O is both shift-invariant andsubmixing. If maximizer has optimal FD strategies (from any state s ) for O forevery finite MDP then maximizer has optimal FD strategies (from any state s )for O for every finite SSG. This applies in particular to the
Gain objective, but not to k - Bailout objectives,as these are not shift-invariant. A proof of Theorem 6 can be found in Appendix A.It uses a recursive argument based on the notion of reset strategies from [35].The remainder of this section is dedicated to proving Theorem 5. We willfirst collect the remaining technical claims about
Gain , Bailout , and reachabilityobjectives. Most notably, as Lemma 8, we show that if maximizer can almostsurely win
Gain in a SSG, then he can do so using a FD strategy which moreoversatisfies an energy-parity objective with strictly positive (and lower-bounded)probability. This is shown in part based on Theorem 6 applied to the
Gain objective. We will also need the following fact about reachability objectives infinite MDPs.
Lemma 7 ([8, Lemma 3.9]).
Let M be a finite MDP and Reach T be thereachability objective with target T def = { s (cid:48) | Val s (cid:48) ( LimInf (= −∞ )) = 1 } . One cancompute a rational constant c < and an integer h ≥ such that for all states s and i ≥ h we have ∀ τ. P τs ( EN ( i ) ∩ Reach T ) ≤ c i − c . Lemma 8.
Consider a finite SSG G = ( V, E, λ ) where Gain holds a.s. for everystate s ∈ V . Then, for every δ ∈ [0 , and s ∈ V , there exists a ˆ k ∈ N and anFD strategy ˆ σ s.t.1. ∀ τ. P ˆ σ,τs ( Gain ) = 1 , and2. ∀ τ. P ˆ σ,τs ( EN (ˆ k ) ∩ PAR ) ≥ δ .Proof. Fix a δ ∈ [0 ,
1) and a state s ∈ V . Both LimInf (= −∞ ), as well as PAR objectives are shift-invariant and submixing , and therefore also the union hasboth these properties. It follows that
Gain = LimInf ( > −∞ ) ∩ PAR = LimInf (= −∞ ) ∪ PAR is both shift-invariant and submixing, since the complement of aparity objective is also a parity objective. By Lemma 16 and Theorem 6, there exists an almost-sure winning FD strategy ˆ σ for maximizer for the objective Gain from s , i.e., ∀ τ. P ˆ σ,τs ( Gain ) = 1, thus yielding Item 1.Let M be the MDP obtained from G by fixing the strategy ˆ σ for maximizerfrom s . Since G is finite and ˆ σ is FD, also M is finite. In M we have ∀ τ. P τs ( Gain ) =1. In particular, in M , the set T def = { s (cid:48) | Val s (cid:48) ( LimInf (= −∞ )) = 1 } is notreachable, i.e., ∀ τ. P τs ( Reach T ) = 0.By Lemma 7, in M there exists a horizon h ∈ N and a constant c < i ≥ h we have ∀ τ. P τs ( EN ( i ) ∩ Reach T ) ≤ c i − c . Since T cannot be reachedin M , the condition Reach T evaluates to true and we have ∀ τ. P τs ( EN ( i )) ≥ − c i − c . Since c < δ <
1, we can pick a sufficiently large ˆ k ≥ h suchthat 1 − c ˆ k − c ≥ δ and obtain ∀ τ. P τs ( EN (ˆ k )) ≥ δ in M . Moreover, the aboveproperty ∀ τ. P τs ( Gain ) = 1 in particular implies ∀ τ. P τs ( PAR ) = 1. Thus we obtain ∀ τ. P τs ( EN (ˆ k ) ∩ PAR ) ≥ δ in M .Back in the SSG G , we have ∀ τ. P ˆ σ,τs ( EN (ˆ k ) ∩ PAR ) ≥ δ as required forItem 2. Lemma 9. EN ( k ) ∩ PAR ⊆ k - Bailout .Proof.
Let ρ be a run in EN ( k ) ∩ PAR . There are two cases. In the first casewe have ρ ∈ ∪ l ST ( k, l ) ∩ PAR and thus directly ρ ∈ k - Bailout . Otherwise, ρ / ∈∪ l ST ( k, l ) ∩ PAR . Since ρ ∈ PAR , we must have ρ / ∈ ∪ l ST ( k, l ). Since ρ ∈ EN ( k ), itfollows that ρ does not satisfy the l -storage condition for any l ∈ N . So, for every l ∈ N , there exists an infix ρ (cid:48) of ρ s.t. l + r ( ρ (cid:48) ) <
0. Let ρ (cid:48)(cid:48) be the prefix of ρ before ρ (cid:48) . Since ρ ∈ EN ( k ) we have k + r ( ρ (cid:48)(cid:48) ρ (cid:48) ) ≥ r ( ρ (cid:48)(cid:48) ) ≥ − k − r ( ρ (cid:48) ) > − k + l .To summarize, if ρ / ∈ ∪ l ST ( k, l ) ∩ PAR then, for every l , it has a prefix ρ (cid:48)(cid:48) with r ( ρ (cid:48)(cid:48) ) > − k + l . Thus ρ ∈ LimSup (= ∞ ). Thus ρ ∈ k - Bailout .We now define W (cid:48) as the set of states that are almost-sure winning forenergy-parity with some sufficiently high initial energy level. ( W (cid:48) is also calledthe winning set for the unknown initial credit problem.) Definition 10. W (cid:48) def = (cid:83) k AS ( EN ( k ) ∩ PAR ) . Lemma 11. AS ( EN ( k ) ∩ PAR ) ⊆ AS ( Gain ∩ G W (cid:48) ) AS ( EN ( k ) ∩ PAR ) ⊆ AS ( k - Bailout ∩ G W (cid:48) ) Proof.
Let s ∈ AS ( EN ( k ) ∩ PAR ) and σ a strategy that witnesses this property.Except for a null-set, all runs ρ = se s e . . . e n − s n . . . from s induced by σ satisfy EN ( k ) ∩ PAR .Let ρ (cid:48) = se s e . . . s m be a finite prefix of ρ . For every n ≥ k + (cid:80) n − i =0 r ( e i ) ≥
0, since ρ ∈ EN ( k ). In particular this holds for all n ≥ m .So, for every n ≥ m , we have k + (cid:80) m − i =0 r ( e i ) + (cid:80) n − i = m r ( e i ) ≥
0. Therefore s m ∈ AS ( EN ( k (cid:48) ) ∩ PAR ), where k (cid:48) = k + (cid:80) m − i =0 r ( e i ), as witnessed by playing σ with history se s e . . . s m from s m . Thus s m ∈ (cid:83) k AS ( EN ( k ) ∩ PAR ) = W (cid:48) ,i.e., almost all σ -induced runs ρ satisfy G W (cid:48) . Towards Item 1, we have EN ( k ) ⊆ LimInf ( > −∞ ) and thus EN ( k ) ∩ PAR ⊆ LimInf ( > −∞ ) ∩ PAR = Gain . Therefore σ witnesses s ∈ AS ( Gain ∩ G W (cid:48) ).Towards Item 2, we have EN ( k ) ∩ PAR ⊆ k - Bailout by Lemma 9. Thus σ witnesses s ∈ AS ( k - Bailout ∩ G W (cid:48) ). Lemma 12. W (cid:48) ⊆ W .Proof. It suffices to show that W (cid:48) satisfies the monotone condition imposed on W (cf. Definition 4), since W is defined as the largest set satisfying this condition.Let s ∈ W (cid:48) = (cid:83) k AS ( EN ( k ) ∩ PAR ). Then s ∈ AS (cid:16) EN (ˆ k ) ∩ PAR (cid:17) for somefixed ˆ k . By Lemma 11(1) we have s ∈ AS ( Gain ∩ G W (cid:48) ). By Lemma 11(2) wehave s ∈ AS (cid:16) ˆ k - Bailout ∩ G W (cid:48) (cid:17) ⊆ (cid:83) k AS ( k - Bailout ∩ G W (cid:48) ). Proof of Theorem 5.
Towards the ⊆ inclusion, we have AS ( EN ( k ) ∩ PAR ) ⊆ AS ( k - Bailout ∩ G W (cid:48) ) ⊆ AS ( k - Bailout ∩ G W )by Lemma 11(2) and Lemma 12.Towards the ⊇ inclusion, let s ∈ AS ( k - Bailout ∩ G W ) and σ be a strategythat witnesses this. We show that s ∈ AS ( EN ( k ) ∩ PAR ). We now consider themodified SSG G (cid:48) = ( W, E, λ ) with the state set restricted to W . In particular, s ∈ W and σ witnesses s ∈ AS ( k - Bailout ) in G (cid:48) . We now construct a strategy σ that witnesses s ∈ AS ( EN ( k ) ∩ PAR ) in G (cid:48) , and thus also in G . The strategy σ will use infinite memory to keep track of the current energy level of the run.Apart from σ , we require several more strategies as building blocks for theconstruction of σ .First, in G we had ∀ s (cid:48) ∈ W. s (cid:48) ∈ AS ( Gain ∩ G W ), and thus in G (cid:48) we have ∀ s (cid:48) ∈ W. s (cid:48) ∈ AS ( Gain ). For every s (cid:48) ∈ W we instantiate Lemma 8 for G (cid:48) with δ = 1 / k s (cid:48) and a strategy ˆ σ s (cid:48) with1. ∀ τ. P ˆ σ s (cid:48) ,τs (cid:48) ( Gain ) = 1, and2. ∀ τ. P ˆ σ s (cid:48) ,τs (cid:48) ( EN (ˆ k s (cid:48) ) ∩ PAR ) ≥ / k def = max { ˆ k s (cid:48) | s (cid:48) ∈ W } . The strategies ˆ σ s (cid:48) are called gain strategies .Second, by the finiteness of V , there is a minimal number k such that (cid:83) k AS ( k - Bailout ∩ G W ) = (cid:83) k ≤ k AS ( k - Bailout ∩ G W ) in G . Therefore, in G (cid:48) wehave that W ⊆ (cid:91) k AS ( k - Bailout ) = (cid:91) k ≤ k AS ( k - Bailout ) = AS ( k - Bailout ) . Thus in G (cid:48) for every s (cid:48) ∈ W there exists a strategy ˜ σ s (cid:48) with ∀ τ. P ˜ σ s (cid:48) ,τs (cid:48) ( k - Bailout ) =1. The strategies ˜ σ s (cid:48) are called bailout strategies . Let k (cid:48) def = k + k − k + 1. Wenow define the strategy σ . Start:
First σ plays like σ from s . Since σ witnesses s ∈ AS ( k - Bailout ) againstevery minimizer strategy τ , almost all induced runs ρ = se s e . . . satisfyeither (A) ( ∪ l ST ( k, l ) ∩ PAR ), or (B) ( EN ( k ) ∩ LimSup (= ∞ )).Almost all runs ρ of the latter type (B) (and potentially also some runs oftype (A)) satisfy EN ( k ) and (cid:80) li =0 r ( e i ) ≥ k (cid:48) eventually for some l . If weobserve (cid:80) li =0 r ( e i ) ≥ k (cid:48) for some prefix se s e . . . e l s (cid:48) of the run ρ then ourstrategy σ plays from s (cid:48) as described in the Gain part below. Otherwise, ifwe never observe this condition, then our run ρ is of type (A) and σ continuesplaying like σ . Since property (A) implies ( EN ( k ) ∩ PAR ), this is sufficient.
Gain:
In this case we are in the situation where we have reached some state s (cid:48) after some finite prefix ρ (cid:48) of the run, where r ( ρ (cid:48) ) ≥ k (cid:48) . Our strategy σ nowplays like the gain strategy ˆ σ s (cid:48) , as long as r ( ρ (cid:48) ) ≥ k (cid:48) − k holds for the currentprefix ρ (cid:48) of the run. By Item 2, this will satisfy ∀ τ. P ˆ σ s (cid:48) ,τs (cid:48) ( EN (ˆ k s (cid:48) ) ∩ PAR ) ≥ / ∀ τ. P ˆ σ s (cid:48) ,τs (cid:48) ( EN ( k ) ∩ PAR ) ≥ /
2. It follows that with probability ≥ / σ s (cid:48) forever and satisfy PAR and always r ( ρ (cid:48) ) ≥ k (cid:48) − k and thus EN ( k ), since k + r ( ρ (cid:48) ) ≥ k + k (cid:48) − k = k + 1 ≥ r ( ρ (cid:48) ) = k (cid:48) − k − k + r ( ρ (cid:48) ) = k . Inthis case (which happens with probability < /
2) we continue playing asdescribed in the
Bailout part below.
Bailout:
In this case we are in the situation where we have reached somestate s (cid:48)(cid:48) ∈ W after some finite prefix ρ (cid:48) of the run, where k + r ( ρ (cid:48) ) = k .Since s (cid:48)(cid:48) ∈ W , we can now let our strategy σ play like the bailout strategy˜ σ s (cid:48)(cid:48) and obtain ∀ τ. P ˜ σ s (cid:48)(cid:48) ,τs (cid:48)(cid:48) ( k - Bailout ) = 1. Thus almost all induced runs ρ (cid:48)(cid:48) = s (cid:48)(cid:48) e s e . . . from s (cid:48)(cid:48) satisfy either (A) ( ∪ l ST ( k , l ) ∩ PAR ), or (B) ( EN ( k ) ∩ LimSup (= ∞ )).As long as r ( ρ (cid:48) ) < k (cid:48) holds for the current prefix ρ (cid:48) of the run, we keepplaying ˜ σ s (cid:48)(cid:48) . Otherwise, if eventually r ( ρ (cid:48) ) ≥ k (cid:48) holds, then we switch backto playing the Gain strategy above. All the runs that never switch back toplaying the
Gain strategy must be of type (A) and thus satisfy
PAR . Sincewe have k - Bailout ⊆ EN ( k ), it follows that, for every prefix ρ (cid:48)(cid:48) of the runfrom s (cid:48)(cid:48) , according to ˜ σ s (cid:48)(cid:48) we have k + r ( ρ (cid:48)(cid:48) ) ≥
0. Thus, for every prefix ρ (cid:48)(cid:48)(cid:48) of ρ , we have k + r ( ρ (cid:48)(cid:48)(cid:48) ) = k + r ( ρ (cid:48) ) + r ( ρ (cid:48)(cid:48) ) = k + r ( ρ (cid:48)(cid:48) ) ≥
0. Therefore, the EN ( k ) objective is satisfied by all runs.As shown above, almost all runs induced by σ that eventually stop switchingbetween the three modes satisfy EN ( k ) ∩ PAR . Switching from Gain/Bailout toStart is impossible, but switching from Gain to Bailout and back is possible.However, the set of runs that infinitely often switch between Gain and Bailout isa null-set, because the probability of switching from Gain to Bailout is ≤ / σ witnesses s ∈ AS ( EN ( k ) ∩ PAR ). Remark 13.
It follows from the results above that W (cid:48) = W . The ⊆ inclusionholds by Lemma 12. For the reverse inclusion we have W ⊆ (cid:91) k AS ( k - Bailout ∩ G W ) by Definition 4= (cid:91) k AS ( EN ( k ) ∩ PAR ) by Theorem 5= W (cid:48) by Definition 10. In this section we will argue that it is possible decide, in NP and coNP , whetherthe bailout objective can be satisfied almost surely. More precisely, we show theexistence of procedures to decide if, for a given k ∈ N and state s , there existsan l ∈ N such that s almost-surely satisfies the Bailout ( k, l ) objective Bailout ( k, l ) def = ( ST ( k, l ) ∩ PAR ) ∪ ( EN ( k ) ∩ LimSup (= ∞ )) . Recall that the idea behind the Bailout objective is that, during a gamefor energy-parity, maximizer is temporarily abandoning the parity (but not theenergy) condition in order to increase the energy to a sufficient level (whichwill then allow him to try an a.s. strategy for
Gain once more). However, in astochastic game – as opposed to an MDP [41] – an opponent could possiblyprevent this increase in energy level at the expense of satisfying the originalenergy-parity objective in the first place (cf. Example 1). The Bailout objectiveis designed to capture the disjunction of both outcomes, as both are favorablefor the maximizer. The parameter k is the acceptable total energy drop (i.e., theinitial value), and the parameter l is the acceptable energy drop on any infix ofa play, which translates to the upper bound on the energy level in the secondoutcome.The question can be phrased equivalently as membership of a control state s in the almost-sure set for the k - Bailout objective for a given game G and energylevel k ∈ N . Theorem 14.
One can check in NP , coNP and pseudo-polynomial time if, fora given SSG G def = ( V, E, λ ) , k ∈ N and control state s ∈ V , maximizer canalmost-surely satisfy k - Bailout from s .Moreover, there are K, L ∈ N , polynomial in | V | and the largest absolutetransition reward, so that (cid:83) k ≥ AS G ( k - Bailout ) = AS G ( Bailout ( K, L )) . And so,checking whether state s belongs to (cid:83) k ≥ AS G ( k - Bailout ) is in NP and coNP .Proof (sketch). This is shown by a sequence of transformations of the gameand ultimately reduced to a finding the winner of a non-stochastic game withan energy-parity objective, which is known to be solvable in NP , coNP andpseudo-polynomial time [19]. One important observation is that it is possible to replace, without changing the outcome, the energy EN ( k ) condition in the Bailout ( k, l ) objective by the more restrictive energy-storage ST ( k, l ) condition.See Appendix B for further details. In this section we will argue that it is possible to decide, in NP and coNP , whetherthe Gain objective (i.e.,
LimInf ( > −∞ ) ∩ PAR ) can be satisfied almost surely.We start by investigating the strategy complexity of winning strategies forthe
Gain objective.
Lemma 15.
In every finite SSG, minimizer has optimal MD strategies forobjective
Gain .Proof.
We show that maximizer has MD optimal strategies for
LimInf (= −∞ ) ∪ PAR . This is equivalent to the claim of the lemma because
LimInf ( > −∞ ) ∩ PAR = LimInf (= −∞ ) ∪ PAR and the complement of a parity condition is itself a paritycondition (with all priorities incremented by one).We note that both
LimInf (= −∞ ), as well as parity objectives PAR are shift-invariant and submixing and therefore also that the union
LimInf (= −∞ ) ∪ PAR has both these properties. The claim now follows from the fact that SSGswith objectives that are both submixing and shift-invariant admit MD optimalstrategies for maximizer [35, Theorem 5.2].Based on the results in [41] one can show a similar claim for maximizer strategiesin MDPs.
Lemma 16.
For finite MDPs, almost-sure winning maximizer strategies for
Gain can be chosen FD.
Using the existence of MD optimal minimizer strategies (Lemma 15) and a coNP upper bound for checking almost sure
Gain in MDPs established in [41], we canderive a coNP procedure. See Appendix C.2 for full details.
Theorem 17.
Checking whether a state s ∈ V of a SSG satisfies Gain almost-surely is in coNP . The rest of this section will deal with the NP upper bound, which is the mostchallenging part of this paper. The crux of our proof is the observation thatif maximizer has a strategy that wins almost surely against all MD minimizerstrategies, then he wins almost surely. This is because one of these MD strategies isoptimal due to Lemma 15. We show that, in order to witness such an almost-surewinning strategy for maximizer in SSG G , it suffices to provide a polynomiallylarger SSG G , together with an almost-sure winning strategy for the storage-parity objective (see Theorem 21 in Section 6) in G . This will give us an NP algorithm, because G , along with its winning strategy, can be guessed and verifiedin polynomial time. Formally we claim that: Theorem 18.
Checking whether a state s ∈ V of G satisfies Gain almost-surelyis in NP .Proof. (sketch) For technical convenience, we will assume w.l.o.g. that everySSG henceforth is in a normal form, where every random state has only onepredecessor, which is owned by the maximizer. To show the existence of G , weare going to introduce two intermediate games: G and G . These games are neverconstructed by our NP algorithm, but are just defined to break down the complexconstruction of G into more manageable steps.Intuitively, G is just G where all rewards on edges are multiplied by a largeenough factor, f , to turn strategies with a mean-payoff > > G is an extension of G where the maximizer is given a choicebefore every visit to a probabilistic node. He can either let the game proceedas before, or sacrifice part of his one-step reward in exchange for a more evenlybalanced reward outcome, so the energy can no longer drop arbitrarily lowwhen a probabilistic cycle is reached. As a result, in G it suffices to considera storage-parity objective (see Theorem 21 in Section 6) instead of Gain . Thenumber of choices maximizer is given is the number of MD minimizer strategies,which clearly can be exponential. That would not suffice for an NP algorithm.Therefore, we show that most of these choices are redundant and can be removedwithout impairing the almost sure wining region. As the result of that pruning,we obtain G of polynomial size.For the the technical details of the G → G → G → G constructions pleasesee Appendix C.3. Figure 2 shows how these transformations may look like. In this section, we prove the main results of the paper, namely that almost-sureenergy parity stochastic games can be decided in NP and coNP . The proofsare straightforward and follow from the much more involved characterization ofalmost sure energy parity objective in terms of the Bailout and
Gain objectivesestablished in Section 3 and their computational complexity analysis in Sections4 and 5, respectively.
Theorem 19.
Given an SSG, energy level k ∗ , checking if a state s is almost-surewinning for EN ( k ∗ ) ∩ PAR is in NP ∩ coNP .Proof. Recall that we can compute the set W from Definition 4 by iterating W i def = AS ( Gain ∩ G W i − ) ∩ (cid:91) k AS ( k - Bailout ∩ G W i − )starting with W def = V , until we reach the greatest fixed point W . Note thatat step i we need to solve almost sure Gain and almost sure (cid:83) k AS ( k - Bailout ),where the states of the game are restricted to W i − . There can be at most | V | steps, because at least one state is removed in each iteration. , − , (a) The original game G = G , − , , , − , , − (b) The game G , − , , , − (c) The game G Fig. 2: An example game G (left) and the derived games. The strategy thatalways loops in the right-most state of G ensures a mean-payoff of 3. As thisis the only MD strategy for maximizer that ensures a positive mean-payoff, afactor f = 1 is sufficient here and we have G = G . In the derived game G inFig. 2b there are as many trade-in options for the random state as there are MDminimizer’s strategies in G (just two in this example). The blue one (top left)corresponds to minimizer going left and the red one (top right) to going up in G .Maximizer almost-surely wins Gain in G iff he almost-surely wins a storage-paritycondition (see Theorem 21) in G .It then suffices to check AS ( k - Bailout ∩ G W ) (i.e., AS ( k - Bailout ) for thesubgame that consists only of the states of the fixed point W for k = k ∗ . Notethat this step can be skipped if k ∗ ≥ K , the bound from Theorem 14.Before we discuss how to use NP and coNP procedures to construct these setsand to conduct the final test on the fixed point W , we note that the ‘ ∩ G W i − ’ does not add anything substantial, as these are simply the same tests and proceduresconducted on the subgame that only consist of the states of W i − .To obtain an NP procedure for constructing AS ( Gain )—or, as remarkedabove, AS ( Gain ∩ G W i − )—we can guess and validate its membership for eachstate s in this set, using the NP result from Theorem 18, and we can guessand validate its non-membership for each state s not in this set in NP , usingthe coNP result from Theorem 17. Similarly, we can guess and validate boththe membership and the non-membership in (cid:83) k AS ( k - Bailout ∩ G W i − )—andof (cid:83) k AS ( k - Bailout ∩ G W i − ) by analysing the subgame with only the states in W i − —by using the NP and coNP result, respectively, from Theorem 14.Once we can construct these sets, we can also intersect them and check if afixed point has been reached. (One can, of course, stop when s / ∈ W i .)We can now conduct the final check in NP using Theorem 18.A coNP algorithm that constructs W can be designed analogously: once W i − is known, membership and non-membership of a state s in AS ( Gain ∩ G W i − ) canbe guessed and validated in coNP by Theorem 17 and by Theorem 18, respectively;and membership or non-membership of a state in (cid:83) k AS ( k - Bailout ∩ G W i − ) canbe guessed and validated in coNP using the coNP and NP part, respectively, ofTheorem 14.Once W is constructed, we can conduct the final check in coNP using Theo-rem 17.This result, together with the upper bound on the energy needed to winenergy-parity objective, allows us to solve the “unknown initial energy problem”[7], which is to compute the minimal initial energy level required. Corollary 20.
For any state s , checking if there is k such that AS ( EN ( k ) ∩ PAR ) holds is in NP ∩ coNP . Also, for a given k ∗ , checking if k ∗ is the minimal energylevel required to win almost surely is in NP ∩ coNP as well.Proof. Due to Theorem 14, if there is an energy level k for which AS ( EN ( k ) ∩ PAR )holds, then it also holds for the bound K whose size is polynomial in the size ofthe game. We can then simply calculate K and then use NP and coNP algorithmsfrom Theorem 19 for AS ( EN ( K ) ∩ PAR ).As for the second claim, note that checking whether maximizer cannot winalmost surely EN ( k ) ∩ PAR is also in NP and coNP as a complement of a coNP and an NP set, respectively. Therefore, for an NP / coNP upper bound it suffices tosimultaneously guess certificates for almost surely EN ( k ∗ ) ∩ PAR and not almostsurely EN ( k ∗ − ∩ PAR and verify them in polynomial time.Finally, let us mention that the slightly more restrictive storage-parity objec-tives can also be solved in NP ∩ coNP . These are almost identical to energy-parityexcept that, in addition, there must exist some bound l ∈ N such that the energylevel never drops by more than l during a run. This extra condition ensuresthat, if the storage-parity objective holds almost-surely, then there must exist a finite-memory winning strategy for maximizer. Theorem 21.
One can check in NP , coNP and pseudo-polynomial time if, fora given SSG H def = ( V, E, λ ) , k ∈ N and control state s ∈ V , maximizer canalmost-surely satisfy ST ( k ) ∩ PAR from s .Moreover, there is a bound L ∈ N , polynomial in the number of states andthe largest absolute transition reward, so that ST ( k ) ∩ PAR = ST ( k, L ) ∩ PAR .Proof. (sketch) This result follows by a simple adaptation of the proofs showingthe same computational complexity of the
Bailout objective (Section 4). See theend of Appendix B for further details.
Example 22.
In the game in Fig. 1, maximizer cannot ensure the storage-paritycondition ST ( k ) ∩ PAR for any initial energy level k . This is because it would implythe existence of a finite-memory almost-surely winning strategy, which as wehave already argued, cannot be true. More intuitively, to prevent an intermediateenergy drop by l units, a winning maximizer strategy for storage-parity wouldneed to stop moving left after observing the negative cycle in the leftmost state l successive times. However, when maximizer moves right, this gives minimizer thechance to visit the rightmost bad state (with dominating odd priority 1). Thechance of that happening is (1 / l >
0. In particular, this probability is > l . Therefore, for any fixed l , maximizerwould need to move right infinitely often to satisfy storage and lose (against anoptimal minimizer strategy that moves to the rightmost state). We showed that several almost-sure problems for combined energy-parity ob-jectives in simple stochastic games are in NP ∩ coNP . No pseudo-polynomialalgorithm is known (just like for stochastic mean-payoff parity games [20]). Allthese problems subsume (stochastic) parity games, by setting all rewards to 0.Thus the existence of a pseudo-polynomial algorithm would imply that (stochasticand non-stochastic) parity games are in P, which is a long-standing open problem.It is known that maximizer already needs infinite memory to win almost-surely a combined energy-parity objective in MDPs [41]. Our results do not implyanything about the memory requirement for optimal minimizer strategies in SSGsfor this objective. We conjecture that memoryless minimizer strategies suffice. Ifthis conjecture holds (and is proven), this would greatly simplify the coNP upperbound that we established for this problem.A natural question is whether results on mean-payoff/energy/parity gamescan be generalized to a setting with multi-dimensional payoffs. Non-stochasticmulti-mean-payoff and multi-energy games have been studied in [50,37,1]. Tothe best of our knowledge, the techniques used there, e.g. upper bounds onthe necessary energy levels as in [37], do not generalize to stochastic games (orMDPs).Multiple mean-payoff objectives in MDPs have been studied in [10,24], butthe corresponding multi-energy (resp. multi-energy-parity) objective has extradifficulties due to the 0-boundary condition on the energy. I.e., even on Markov chains, and without any parity condition, it subsumes problems about multi-dimensional random walks. Some partial results on Markov chains and MDPshave been obtained in [13,2,3], but the decidability of the almost-sure problemfor stochastic multi-energy-parity games (and MDPs) remains open. Acknowledgments
The work of Sven Schewe and Dominik Wojtczak was supported by EPSRC grantEP/P020909/1. References
1. Abdulla, P., Mayr, R., Sangnier, A., Sproston, J.: Solving parity games on integervectors. In: International Conference on Concurrency Theory (CONCUR). vol. 8052(2013)2. Abdulla, P.A., Ciobanu, R., Mayr, R., Sangnier, A., Sproston, J.: Qualitativeanalysis of VASS-induced MDPs. In: International Conference on Foundations ofSoftware Science and Computational Structures (FoSSaCS). vol. 9634 (2016)3. Abdulla, P.A., Henda, N.B., Mayr, R.: Decisive Markov Chains. Logical Methodsin Computer Science Volume 3, Issue 4 (Nov 2007), https://lmcs.episciences.org/867
4. Alur, R., Henzinger, T.A., Kupferman, O.: Alternating-time temporal logic. J. ACM49(5), 672–713 (2002)5. Billingsley, P.: Probability and Measure. Wiley (1995), third Edition6. Blondel, V.D., Canterini, V.: Undecidable problems for probabilistic automata offixed dimension. Theory of Computing systems 36(3) (2003)7. Bouyer, P., Fahrenberg, U., Larsen, K.G., Markey, N., Srba, J.: Infinite runs inweighted timed automata with energy constraints. In: International Conferenceon Formal Modeling and Analysis of Timed Systems (FORMATS). vol. 5215, pp.33–47 (2008)8. Br´azdil, T., Broˇzek, V., Etessami, K., Kuˇcera, A.: Approximating the TerminationValue of One-Counter MDPs and Stochastic Games. Information and Computation222, 121–138 (2013)9. Br´azdil, T., Kuˇcera, A., Novotn´y, P.: Optimizing the expected mean payoff inenergy Markov decision processes. In: International Symposium on AutomatedTechnology for Verification and Analysis (ATVA). vol. 9938, pp. 32–49 (2016)10. Br´azdil, T., Broˇzek, V., Chatterjee, K., Forejt, V., Kuˇcera, A.: Markov decisionprocesses with multiple long-run average objectives. Logical Methods in ComputerScience 10 (2014),
11. Br´azdil, T., Broˇzek, V., Etessami, K.: One-Counter Stochastic Games. In: IARCSAnnual Conference on Foundations of Software Technology and Theoretical Com-puter Science (FSTTCS). vol. 8, pp. 108–119 (2010)12. Br´azdil, T., Broˇzek, V., Etessami, K., Kuˇcera, A., Wojtczak, D.: One-counterMarkov decision processes. In: ACM-SIAM Symposium on Discrete Algorithms(SODA). pp. 863–874 (2010)13. Br´azdil, T., Kiefer, S., Kuˇcera, A., Novotn´y, P., Katoen, J.P.: Zero-reachability inprobabilistic multi-counter automata. In: Proceedings of the Joint Meeting of theTwenty-Third EACSL Annual Conference on Computer Science Logic (CSL) andthe Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science(LICS). pp. 22:1–22:10 (2014)14. Chakrabarti, A., De Alfaro, L., Henzinger, T.A., Stoelinga, M.: Resource interfaces.In: International Workshop on Embedded Software. pp. 117–133 (2003)15. Chatterjee, K., De Alfaro, L., Henzinger, T.A.: The complexity of stochastic Rabinand Streett games. In: International Colloquium on Automata, Languages andProgramming (ICALP). pp. 878–890 (2005)16. Chatterjee, K., Doyen, L.: Energy parity games. In: International Colloquium onAutomata, Languages and Programming (ICALP). vol. 6199, pp. 599–610 (2010)17. Chatterjee, K., Doyen, L.: Energy and mean-payoff parity Markov decision processes.In: International Symposium on Mathematical Foundations of Computer Science(MFCS). vol. 6907, pp. 206–218 (2011)118. Chatterjee, K., Doyen, L.: Games and Markov decision processes with mean-payoffparity and energy parity objectives. In: Mathematical and Engineering Methods inComputer Science (MEMICS). LNCS, vol. 7119, pp. 37–46. Springer (2011)19. Chatterjee, K., Doyen, L.: Energy parity games. Theoretical Computer Science 458,49–60 (2012)20. Chatterjee, K., Doyen, L., Gimbert, H., Oualhadj, Y.: Perfect-information stochasticmean-payoff parity games. In: International Conference on Foundations of SoftwareScience and Computational Structures (FoSSaCS). vol. 8412 (2014)21. Chatterjee, K., Henzinger, T.A., Piterman, N.: Generalized parity games. In: In-ternational Conference on Foundations of Software Science and ComputationalStructures (FoSSaCS). pp. 153–167 (2007)22. Chatterjee, K., Jurdzi´nski, M., Henzinger, T.A.: Simple stochastic parity games. In:Computer Science Logic (CSL). vol. 2803, pp. 100–113. Springer (2003)23. Chatterjee, K., Jurdzi´nski, M., Henzinger, T.A.: Quantitative stochastic paritygames. In: ACM-SIAM Symposium on Discrete Algorithms (SODA). pp. 121–130.SIAM (2004)24. Chatterjee, K., Kret´ınsk´a, Z., Kret´ınsk´y, J.: Unifying two views on multiple mean-payoff objectives in Markov decision processes. Logical Methods in ComputerScience 13(2) (2017), https://doi.org/10.23638/LMCS-13(2:15)2017
25. Clarke, E., Grumberg, O., Peled, D.: Model Checking. MIT Press (Dec 1999)26. Daviaud, L., Jurdzi´nski, M., Lazi´c, R.: A pseudo-quasi-polynomial algorithm formean-payoff parity games. In: Logic in Computer Science (LICS). pp. 325–334(2018)27. De Alfaro, L., Henzinger, T.A.: Interface automata. ACM SIGSOFT SoftwareEngineering Notes 26(5), 109–120 (2001)28. Dill, D.L.: Trace theory for automatic hierarchical verification of speed-independentcircuits, vol. 24. MIT press Cambridge (1989)29. Fearnley, J.: Exponential lower bounds for policy iteration. In: International Collo-quium on Automata, Languages and Programming (ICALP). pp. 551–562 (2010)30. Filar, J., Vrieze, K.: Competitive Markov Decision Processes. Springer (1997)31. Friedmann, O.: An exponential lower bound for the parity game strategy improve-ment algorithm as we know it. In: Logic in Computer Science (LICS). pp. 145–156(2009)32. Friedmann, O., Hansen, T.D., Zwick, U.: Subexponential lower bounds for ran-domized pivoting rules for the simplex algorithm. In: Symposium on Theory ofComputing (STOC). pp. 283–292 (2011)33. Gillette, D.: Stochastic games with zero stop probabilities. Contributions to theTheory of Games 3, 179–187 (1957)34. Gimbert, H., Horn, F.: Solving simple stochastic tail games. In: ACM-SIAM Sym-posium on Discrete Algorithms (SODA). pp. 847–862 (2010), http://epubs.siam.org/doi/abs/10.1137/1.9781611973075.69
35. Gimbert, H., Kelmendi, E.: Two-Player Perfect-Information Shift-InvariantSubmixing Stochastic Games Are Half-Positional (Jan 2014), https://hal.archives-ouvertes.fr/hal-00936371 , working paper or preprint, https://hal.archives-ouvertes.fr/hal-00936371
36. Jurdzi´nski, M.: Deciding the winner in parity games is in UP ∩ co-UP. InformationProcessing Letters 68(3), 119–124 (1998)37. Jurdzi´nski, M., Lazi´c, R., Schmitz, S.: Fixed-dimensional energy games are inpseudo-polynomial time. In: International Colloquium on Automata, Languagesand Programming (ICALP). vol. 9135, pp. 260–272 (2015)238. Kiefer, S., Mayr, R., Shirmohammadi, M., Wojtczak, D.: On strong determinacy ofcountable stochastic games. Logic in Computer Science (LICS) (2017)39. Maitra, A., Sudderth, W.: Stochastic games with Borel payoffs. In: StochasticGames and Applications, pp. 367–373. Kluwer, Dordrecht (2003)40. Martin, D.A.: The determinacy of Blackwell games. Journal of Symbolic Logic63(4), 1565–1581 (1998)41. Mayr, R., Schewe, S., Totzke, P., Wojtczak, D.: MDPs with Energy-Parity Objec-tives. Logic in Computer Science (LICS) (2017)42. Pnueli, A., Rosner, R.: On the synthesis of a reactive module. In: Annual Symposiumon Principles of Programming Languages (POPL). pp. 179–190 (1989)43. Pogosyants, A., Segala, R., Lynch, N.: Verification of the randomized consensusalgorithm of Aspnes and Herlihy: a case study. Distributed Computing 13(3),155–186 (2000)44. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Pro-gramming. John Wiley & Sons, Inc., 1st edn. (1994)45. Rabin, M.O.: Automata on infinite objects and Church’s problem, vol. 13. AmericanMathematical Soc. (1972)46. Ramadge, P.J., Wonham, W.M.: Supervisory control of a class of discrete eventprocesses. SIAM journal on control and optimization 25(1), 206–230 (1987)47. Schrijver, A.: Theory of linear and integer programming. John Wiley & Sons (1998)48. Shapley, L.S.: Stochastic games. Proceedings of the national academy of sciences39(10), 1095–1100 (1953)49. Stoelinga, M.: Fun with firewire: A comparative study of formal verification methodsapplied to the IEEE 1394 root contention protocol. Formal aspects of computing14(3), 328–337 (2003)50. Velner, Y., Chatterjee, K., Doyen, L., Henzinger, T.A., Rabinovich, A., Raskin,J.F.: The complexity of multi-mean-payoff and multi-energy games. Informationand Computation 241, 177 – 196 (2015)51. Wilke, T.: Alternating tree automata, parity games, and modal mu-calculus. Bulletinof the Belgian Mathematical Society Simon Stevin 8(2), 359 (2001)3 A Lifting Almost-sure Strategies from MDPs to SSGs
This section contains a proof of Theorem 6, that shows how to conclude lift theexistence of memoryless determined almost-sure winning strategies from MDPsto SSGs.
Definition 23 ([35], Sec. 2.C).
Let G = ( V, E, λ ) be an SSG and σ a maxi-mizer strategy.For every finite play p = s . . . s n we denote by σ [ p ] the shift of strategy σ by p as the strategy defined by σ [ p ]( t e t . . . e m t m ) def = (cid:40) σ ( pe t . . . e m t m ) , if s n = t σ ( t e t . . . e m t m ) , otherwiseThen σ is said to be ε -subgame-perfect if for every finite play p the strategy σ [ p ] is ε -optimal. Definition 24 ([35], Sec. 5.C).
Let G = ( V, E, λ ) be an SSG from initial state s and O an objective that is both shift-invariant and submixing and π ∈ V a stateowned by maximizer. We assume w.l.o.g., that π has two successors left ( l ) andright ( r ). Let G l and G r be the SSGs resulting from G by removing the edge from π to r and l , respectively. Moreover, assume that τ l and τ r are (cid:15) -subgame-perfectstrategies for minimizer in G l and G r , respectively.The trigger strategy τ lr for minimizer in the original game G (starting at s )is defined as follows: – start by playing according to τ l – play according to τ r (initially with empty memory) once maximizer movesfrom π to r for the first time. – every time maximizer moves from π to l (or to r ), maximizer resumes theprevious play in G l (or G r ).The trigger strategy τ lr allocates the memory used by τ l , τ r , and one extra bit toremember maximizer’s last choice at π . Lemma 25 ([35], Eq. (19),(20),(21)).
Assume the definitions of Defini-tion 24. Then ∀ σ. P σ,τ lr s ( O ) ≤ max { Val G l s ( O ) , Val G r s ( O ) } + (cid:15) Theorem 6.
Let O be an objective such that O is both shift-invariant andsubmixing. If maximizer has optimal FD strategies (from any state s ) for O forevery finite MDP then maximizer has optimal FD strategies (from any state s )for O for every finite SSG.Proof. We assume w.l.o.g., that all minimizer’s states have at most two successors.The proof is done by induction on the number of minimizer’s states with choice(two successors) in G . The base case holds by the assumption that maximizer hasFD optimal strategies in MDPs. For the induction step, we will use Definition 24 and Lemma 25, instantiatedwith O instead of O . Since O is both shift-invariant and submixing, this satisfiesthe conditions of Definition 24, but (relative to O ) the roles of the playersminimizer/maximizer are swapped.Pick some initial state s and a minimizer’s state π for O (i.e., a maximizer’sstate for O ) and let G l , G r be defined as in Definition 24. By induction hypothesis,in both these games G l and G r , maximizer has an FD optimal strategy forobjective O from s . Call these strategies σ l and σ r , respectively. In particular,since σ l and σ r are optimal and O and O are shift-invariant, the strategies σ l and σ r are subgame-perfect, and thus (cid:15) -subgame-perfect for (cid:15) = 0. Thus wecan instantiate Definition 24 with objective O and reversed roles of playersminimizer/maximizer. I.e., we take σ l for τ l and σ r for τ r , which are subgame-perfect for player minimizer for objective O . We obtain the trigger-strategy σ lr for maximizer for O (i.e., the τ lr for minimizer for O from Definition 24). Since σ l and σ r are FD, so is σ lr .We now argue that this trigger strategy σ lr must be optimal. The shift-invariance and submixing conditions on O imply ([35], Theorem 5.2) that mini-mizer has MD optimal strategies in every SSG with winning condition O . Let τ ∗ be some MD optimal strategy for minimizer in G from s . W.l.o.g. assume that τ ∗ ( π ) = l (otherwise rename left/right).We show that σ lr and τ ∗ are best responses to each other, and thus both areoptimal. That is, in order to finish the induction step, we prove that the followingtwo claims hold for the game G .1. P σ lr ,τ ∗ s ( O ) ≥ sup σ P σ,τ ∗ s ( O ), and2. P σ lr ,τ ∗ s ( O ) ≤ inf τ P σ lr ,τs ( O ).Together these imply the claim that σ lr is optimal, and hence the induction step,because Val s ( O ) = sup σ inf τ P σ,τs ( O ) (opt.) = sup σ P σ,τ ∗ s ( O ) (1) = P σ lr ,τ ∗ s ( O ) (2) = inf τ P σ lr ,τs ( O ) (1)where the second equation uses the optimality of τ ∗ . It remains to prove the twoclaims above.Item 1). Since τ ∗ ( π ) = l we have P σ lr ,τ ∗ G ,s ( O ) = P σ l ,τ ∗ G l ,s ( O ) ≥ sup σ P σ,τ ∗ G l ,s ( O ) =sup σ P σ,τ ∗ G ,s ( O ), where the equalities hold by τ ∗ ( π ) = l and the inequality holdsby the assumed optimality of σ l in G l .Item 2). From Lemma 25, instantiated with O , we obtain that ∀ τ. P σ lr ,τs ( O ) ≤ max { Val G l s ( O ) , Val G r s ( O ) } . + (cid:15) Since in our case (cid:15) = 0 we obtain ∀ τ. − P σ lr ,τs ( O ) ≤ max { − Val G l s ( O ) , − Val G r s ( O ) } = 1 − min { Val G l s ( O ) , Val G r s ( O ) } and thus ∀ τ. P σ lr ,τs ( O ) ≥ min { Val G l s ( O ) , Val G r s ( O ) } . In particular, for τ = τ ∗ we obtain P σ lr ,τ ∗ s ( O ) ≥ min { Val G l s ( O ) , Val G r s ( O ) } . How-ever, since τ ∗ is an MD optimal strategy for minimizer, we also have P σ lr ,τ ∗ s ( O ) ≤ min { Val G l s ( O ) , Val G r s ( O ) } By combining the above we get ∀ τ. P σ lr ,τs ( O ) ≥ min { Val G l s ( O ) , Val G r s ( O ) } = P σ lr ,τ ∗ s ( O )and thus inf τ P σ lr ,τs ( O ) ≥ P σ lr ,τ ∗ s ( O ). This concludes the proof of Item 2 andthus the induction step. B Bailout
We will proceed in several reduction steps, ultimately reducing to checking thewinner of a non-stochastic game for energy-parity objectives.Assume from now on a fixed SSG G with associated reward and parityfunctions. Lemma 26.
Let
Bailout (cid:48) ( k, l ) def = ( ST ( k, l ) ∩ PAR ) ∪ ( ST ( k, l ) ∩ LimSup (= ∞ )) .There exists L ∈ N so that AS ( (cid:83) l Bailout ( k, l )) = AS (cid:0) Bailout (cid:48) ( k, L ) (cid:1) .Proof. Pick L larger than | V | · R · c , the number of control states in the gametimes the largest absolute reward R times the largest priority c used in the paritycondition.We claim that every a.s. winning strategy can be turned into one that avoidssub-runs of the form s π −→ s π −→ s where 1) both π and π have strictly negativetotal effect on the energy level, 2) neither π nor π visit state s internally, 3)the dominant priority on π and π is the same. If a strategy allows such apath, then one can safely “cut out” π and the resulting strategy will still bea.s. winning. Taken to the limit, such transformations will result in a strategythat is a.s. winning for AS (cid:0) Bailout (cid:48) ( k, L ) (cid:1) . Lemma 27.
Let
Bailout (cid:48)(cid:48) ( k, l ) def = ( ST ( k, l ) ∩ PAR ) ∪ ( ST ( k, l ) ∩ LimInf (= ∞ )) .For every k, l ∈ N it holds that Bailout (cid:48) ( k, l ) = Bailout (cid:48)(cid:48) ( k, l ) .Proof. Just notice that a run ρ = s e s e . . . ∈ ST ( k, l ) ∩ LimSup (= ∞ )must also satisfy the LimInf (= ∞ ) condition because (lim inf n →∞ (cid:80) ni =0 r ( e i )) ≥ (lim sup n →∞ (cid:80) ni =0 r ( e i )) − l , by the l -storage assumption.The idea of the next step is to allow maximizer to witness the LimInf (= ∞ )condition by occasionally trading in energy for a good priority, thereby satisfyinga parity condition instead. This results in a stochastic game for a ST ( k, l ) ∩ PAR objective.Let G (cid:48) be the SSG derived from G , where maximizer can always trade energy-increase for visiting the best possible priority 0. That is, G (cid:48) results from G byreplacing every edge s + a −−→ t , with a >
0, by a gadget below, where s (cid:48) ∈ V (cid:50) , parity ( s (cid:48) ) = parity ( s ) and parity ( t (cid:48) ) = 0. s s’ t’ t a Lemma 28.
For every state s of G , and every k, l ∈ N it holds that s ∈ AS G (cid:0) Bailout (cid:48)(cid:48) ( k, l ) (cid:1) if, and only if, s ∈ AS G (cid:48) ( ST ( k, l ) ∩ PAR ) . Proof.
Assume that R ∈ N is the largest absolute transition reward in G (andhence also G (cid:48) ). Every a.s. winning strategy σ for Bailout (cid:48)(cid:48) ( k, l ) = ST ( k, l ) ∩ ( PAR ∪ LimInf (= ∞ )) in G can be turned into an a.s. winning strategy σ (cid:48) for ST ( k, l ) ∩ PAR in G (cid:48) as follows.The new strategy σ (cid:48) behaves just as σ but additionally, keeps track of theenergy levels up to the bound l · R . If in G , σ chooses to increase the energylevel above this bound, σ (cid:48) will opt to visit a good priority instead, and continuefrom the current energy level. Since σ ensures the l -storage condition on (almost)all runs, so does σ (cid:48) . Moreover, plays in G that do not satisfy PAR must insteadsatisfy
LimInf (= ∞ ). The corresponding runs in G (cid:48) according to σ (cid:48) will thereforeinfinitely often visit the best priority and hence satisfy the parity condition.For the other direction, notice that one can just as well transform an a.s. win-ning strategy σ (cid:48) for storage-parity in G (cid:48) to a winning strategy σ for Bailout (cid:48)(cid:48) ( k, l )in G . The strategy σ just increments the energy level and whenever σ (cid:48) would visita newly introduced priority-0 state. Suppose ρ is a play in G that corresponds toa play ρ (cid:48) in G (cid:48) . If ρ (cid:48) visits new states only finitely often, then after some finiteprefix, the sequence of states visited by ρ (cid:48) and ρ (cid:48) are the same. Since ρ (cid:48) satisfiesthe parity condition so must ρ . Otherwise, if ρ (cid:48) visits new states infinitely often,then ρ the difference of energy levels on ρ and ρ (cid:48) must grow unboundedly. Since ρ (cid:48) satisfies the l -storage condition this means that ρ satisfies the LimInf (= ∞ )condition, and hence Bailout (cid:48)(cid:48) ( k, l ).Finally, we use a construction similar to that in [23] for parity objectives,to replace random states by small “negotiation gadgets”, resulting in a non-stochastic energy-parity game. Let G (cid:48)(cid:48) be the non-stochastic game derived from G (cid:48) , where random states are replaced by gadgets as in [23]. Lemma 29.
For every state s of G (cid:48) and every k, l ∈ N it holds that s ∈ AS G (cid:48) ( ST ( k, l ) ∩ PAR ) if, and only if, s ∈ AS G (cid:48)(cid:48) ( ST ( k, l ) ∩ PAR ) .Proof. The construction in [23] does not affect the transition rewards. Thus the ST ( k, l ) condition is trivially preserved. The a.s. PAR condition is preserved byexactly the same argument as in [23].
Theorem 14.
One can check in NP , coNP and pseudo-polynomial time if, fora given SSG G def = ( V, E, λ ) , k ∈ N and control state s ∈ V , maximizer canalmost-surely satisfy k - Bailout from s .Moreover, there are K, L ∈ N , polynomial in | V | and the largest absolutetransition reward, so that (cid:83) k ≥ AS G ( k - Bailout ) = AS G ( Bailout ( K, L )) . And so,checking whether state s belongs to (cid:83) k ≥ AS G ( k - Bailout ) is in NP and coNP . Proof.
By Lemmas 26 to 29, for every k ∈ N it holds that AS G (cid:32)(cid:91) l Bailout ( k, l ) (cid:33) (L. 26) = AS G (cid:0) Bailout (cid:48) ( k, L ) (cid:1) (L. 27) = AS G (cid:0) Bailout (cid:48)(cid:48) ( k, L ) (cid:1) (L. 28) = AS G (cid:48) ( ST ( k, L ) ∩ PAR ) (L. 29) = AS G (cid:48)(cid:48) ( ST ( k, L ) ∩ PAR )Since G (cid:48)(cid:48) is a two-player non-stochastic game, the claim now follows from [19],(Theorem 2 and Lemma 5). For the existence of polynomially bounded number K, L just notice that G (cid:48)(cid:48) has the same largest absolute transition reward, andonly a polynomially larger set of states compared to G . For non-stochasticenergy-parity games such as G (cid:48)(cid:48) it holds that (cid:83) k ≥ AS ( ST ( k, L ) ∩ PAR ) = AS (cid:16)(cid:83) k ≥ ( ST ( k, k ) ∩ PAR ) (cid:17) = AS ( EN ( K ) ∩ PAR ), if K denotes the productof the number of states, the largest priority and absolute transition rewards in G (cid:48)(cid:48) .Now, to check if a state s belongs to (cid:83) k ≥ AS G ( k - Bailout ), we can calculate K and then simply follow the NP or coNP procedure to check if s belongsto AS G (K- Bailout ) instead. This shows that this problem in NP and coNP aswell.As a side result, note that neither Lemma 29, nor the complexity argument inTheorem 14, make use of the structure of G (cid:48) : they hold for all SSGs with storageparity condition. Theorem 21.
One can check in NP , coNP and pseudo-polynomial time if, fora given SSG H def = ( V, E, λ ) , k ∈ N and control state s ∈ V , maximizer canalmost-surely satisfy ST ( k ) ∩ PAR from s .Moreover, there is a bound L ∈ N , polynomial in the number of states andthe largest absolute transition reward, so that ST ( k ) ∩ PAR = ST ( k, L ) ∩ PAR . C Gain
C.1 Strategy Complexity for
Gain
We prove Lemma 16, i.e., if maximizer can almost-surely win
Gain in an MDP,then he can do so using a finite-memory deterministic strategy.To do this, we will utilize some results from [41], where we showed how tocompute winning regions for energy-parity objectives in MDPs based on a similarcombination of “gain” and “bailout” objectives as in this paper.Consider a state s of a finite MDP with energy-parity objective and definethe limit value of state s as LVal s def = sup k Val s ( EN ( k ) ∩ PAR ). This is well defined,because energy conditions are monotone increasing in the initial energy level k . Lemma 30.
For any state s of a finite MDP, we have Val s ( Gain ) =
LVal s .Proof. It follows directly from the definitions that for every k ∈ N EN ( k ) ∩ PAR ⊆ LimInf ( ≥ − k ) ∩ PAR ⊆ LimInf ( > −∞ ) ∩ PAR = Gain and thus (cid:91) k ( EN ( k ) ∩ PAR ) ⊆ Gain (2)Towards the reverse inclusion, consider a run ρ ∈ LimInf ( ≥ − j ) ∩ PAR for some j ∈ N . Then, except in a finite prefix ρ (cid:48) , the energy along ρ stays above − j . Let k (cid:48) be the minimal energy reached in ρ (cid:48) , which is finite because ρ (cid:48) is finite, andlet k def = − min( k (cid:48) , − j ). Then ρ ∈ EN ( k ) ∩ PAR ⊆ (cid:83) k ( EN ( k ) ∩ PAR ). So for every j ∈ N we have LimInf ( ≥ − j ) ∩ PAR ⊆ (cid:91) k ( EN ( k ) ∩ PAR )and thus
Gain = (cid:91) j ( LimInf ( ≥ j ) ∩ PAR ) ⊆ (cid:91) k ( EN ( k ) ∩ PAR ) (3)From Eq. (2) and Eq. (3) we obtain
Gain = (cid:91) k ( EN ( k ) ∩ PAR ) (4)Therefore,
Val s ( Gain ) =
Val s ( ∪ k ( EN ( k ) ∩ PAR )) by Eq. (4)= sup σ P σs ( ∪ k ( EN ( k ) ∩ PAR )) def. of value= sup σ sup k P σs ( EN ( k ) ∩ PAR ) continuity of measures from below= sup k sup σ P σs ( EN ( k ) ∩ PAR ) commutativity= sup k Val s ( EN ( k ) ∩ PAR ) def. of value=
LVal s def. of LVal s Lemma 16.
For finite MDPs, almost-sure winning maximizer strategies for
Gain can be chosen FD.Proof.
By Lemma 30 we have
Val s ( Gain ) =
LVal s . Moreover, the objective Gain is shift-invariant and therefore there exist optimal strategies [34]. Thus it followsfrom [41, Theorem 18] that AS ( Gain ) = AS ( Reach ( A ∪ B )), for the following setsof states A def = (cid:83) k ∈ N AS ( ST ( k ) ∩ PAR ) and B def = AS ( LimInf (= ∞ ) ∩ PAR ). Thismeans that if an a.s. winning strategy for
Gain exists, then there also exists onethat operates in two phases: 1) a.s. reach A ∪ B . This can be done with memorylessdeterministic strategies. 2a) once in A proceed along an a.s. winning strategy for ST ( k ) ∩ PAR , which can be done deterministically with memory O ( k · | G | ). Or,2b) once in B , proceed along an a.s. winning strategy for LimInf (= ∞ ) ∩ PAR .For MDPs a strategy is almost-sure winning for
LimInf (= ∞ ) ∩ PAR iff it isalmost-sure winning for MP ( > ∩ PAR , the combination of a parity conditiontogether with a strictly positive Mean-Payoff condition. Such strategies can bechosen FD [17].
C.2
Gain is in coNP
Theorem 17.
Checking whether a state s ∈ V of a SSG satisfies Gain almost-surely is in coNP .Proof.
By Lemma 15, it suffices to show coNP membership only for the MDPcase, as a witnessing MD strategy for minimizer can be guessed as part of thecertificate. To check if maximizer can almost surely win from state s in an MDPwith Gain objective, we can equivalently check if
Val s ( Gain ) = 1. This is becausethe objective is shift-invariant and therefore there exist optimal strategies [34].By Lemma 30, we can alternatively check if
LVal s = 1, which can be done in coNP by [41, Lemma 26]. C.3
Gain is in NP Before we can proceed with the technical details of the
G → G → G → G constructions, we first need to introduce the following standard definitions. Definition 31.
Let M = G [ τ ] be an MDP induced by game G def = ( V = ( V (cid:50) , V (cid:51) , V (cid:35) ) , E, λ ) and an MD strategy τ for minimizer. An end-component is a strongly connectedset of states C ⊆ V such that, for every state v ∈ C , if v ∈ V (cid:50) then somesuccessor v (cid:48) of v is in C , and if v ∈ V (cid:35) ∪ V (cid:51) then all successors v (cid:48) of v are in C .A leaf-component is an end-component of a Markov chain G [ σ, τ ] .A leaf-component is storage-parity-safe if the dominating priority is even andit satisfies the storage condition (cid:83) k ≥ ST ( k ) , and mean-positive if its mean-payoffis positive.An end-component C of G [ τ ] is gain-safe if (1) the dominating priority of C (the smallest priority of any state in C ) is even and C contains a mean-positiveleaf-component or (2) there is an MD strategy σ for the maximizer, such that C is a storage-parity-safe leaf-component in G [ σ, τ ] . Note that any end-component that satisfies a.s.
Gain is gain-safe, whichjustifies its name. This is because either (1) holds, or else maximizer can reachagain and again a state with a dominating even priority without the need topump up the energy level first for which an MD strategy suffices, so (2) wouldhold then.
C.3.1 Blow-up Construction (
G → G ) The
G → G construction just multiplies all the rewards by a “large enough“factor. Formally, we need G to have the following property. Lemma 32.
Let τ be an MD strategy for the minimizer and s ∈ V . If there existsa strategy σ for the maximizer such that P G ,σ,τs ( MP ( > ∩ PAR ) = 1 then thereexists an FD strategy σ (cid:48) for the maximizer such that P G ,σ (cid:48) ,τs ( MP ( > ∩ PAR ) = 1 . We construct a game G , based on G , in which all edge rewards are multipliedby a large factor so that if the maximizer can originally ensure the parity conditionand a positive expected mean-payoff in G , then he can ensure parity conditionand expected mean-payoff higher than 2 in G . It is intuitively clear that sucha factor exists, because multiplying all transition rewards by a positive factorhas no effect on the outcome of the Gain objective. What is less clear that such afactor can be of polynomial size so that G is only polynomially larger than G .Before we can proceed with the proof of Lemma 32, we need to show an auxiliaryresult below.Recalling that, for the Gain objective, the minimizer has MD optimal strategies,we consider the effect of multiplying the rewards of all edges by factor f againstevery such strategy τ : We show that if maximizer can a.s. obtain MP ( > ∩ PAR from a state s in the MDP G [ τ ], then he can a.s. obtain MP ( > ∩ PAR from s in the MPD G [ τ ]. Lemma 33.
Let (1) τ be an MD strategy for minimizer, (2) E be an endcomponent in the MDP G [ τ ] , with even minimal priority, (3) σ an MD strategyfor Max, and (4) L ⊆ E a leaf component in G [ σ, τ ] with expected payoff p > .Then p is exponential in the size of G .Moreover, a factor f > p , with a representation polynomial in |G| , can becomputed independent of τ , E, σ , or L .Proof. For any fixed MD strategies σ and τ , we can write a linear program forthe so-called gain-bias relations in L , which is a standard way to solve MDPswith a mean-payoff objective (see, e.g., [44, Theorem 8.2.6(a), p. 343]). In anysolution, the gain of a state equals its mean-payoff value while, broadly speaking,the bias compensates for the fluctuation of the payoff, where the gain is only theexpected longterm average. The terms ‘gain’ from the ‘gain-bias relations’ and our ‘
Gain objective’ are unrelatedestablished terms.2
Notice that for a fixed L , we only need a single gain variable g , because allnodes in a leaf component have the same mean-payoff. For each node u ∈ L , weintroduce a bias variable, b u .The constraints of the gain-bias linear program for L are: b u = b τ ( u ) + r ( u, τ ( u )) − g for all u ∈ V (cid:51) ∩ Lb u = b σ ( u ) + r ( u, σ ( u )) − g for all u ∈ V (cid:50) ∩ Lb u = (cid:88) ( u,v ) ∈ E λ ( u, v )( b v + r ( u, v )) − g for all u ∈ V (cid:35) ∩ L and its objective is Maximize g .It follows from the proof of Corollory 10.2a in [47] that the size of an optimalfinite solution to such this linear program is at most 4 m ( m + 1)( S + 1), where m is the number of variables and S is the maximum size of any coefficient used.In our case we can easily estimate that m ≤ | V | + 1 and S ≤ |G| , so the optimalsolution, p , is of size polynomial in |G| . And, since p >
0, the same holds for 2 /p .Note that the loose upper bound given above on the size of 2 /p does notreally depend on τ , σ , E nor L , so if we take the maximum of the size of 2 /p over all possible τ , σ , E and L , we would still get the same upper bound.Such an f will serve as our sufficiently large (yet sufficiently small) blow-upfactor: G is obtained from G by changing the reward function to r ( e ) = f · r ( e )for all e ∈ E , i.e., by multiplying all rewards by f . We are now finally ready toprove Lemma 32. Proof of Lemma 32.
The existence of an FD strategy σ (cid:48) that achieves P G ,σ (cid:48) ,τs ( MP ( > ∩ PAR ) = 1 follows from [17]. Moreover, σ achieves the same mean-payoff,denoted by p (cid:48) , as the original almost-sure winning strategy σ . By Lemma 33, themean payoff of σ (cid:48) in G is ≥ f · p (cid:48) > , − , (a) The original game G . , − , (b) The derived game G which happens tobe equal to G . Fig. 3: An example game G (left) and its example derived game G . Example 34 (running example).
Consider the game G in Fig. 3 (left). Maximizercan almost-surely guarantee the Gain condition. The strategy that always loops inthe right-most state ensures a mean-payoff of 3. As this is the only MD strategyfor maximizer that ensures a positive mean-payoff, picking any factor f > issufficient. In particular we can pick f = 1 which results in G = G . C.3.2 Trade-in Construction ( G → G ) We are now going to modify the game G into the game G , where maximizer cansacrifice part of the reward he would normally get while visiting a probabilisticnode in exchange for rebalancing the values of these rewards.During the construction of G we are going to fix an optimal MD strategy, τ ∗ ,for minimizer in G . Game G will be the same no matter which optimal strategyis picked as τ ∗ .We start the construction of G with identifying the union, U , of all gain-safeend-components of G [ τ ∗ ], for which there is no maximizer strategy that ensures MP ( > ∩ PAR . Condition (2) of gain-safeness has to hold instead, i.e., thereare MD maximizer strategies that a.s. satisfy storage and parity, and note thatthen the mean-payoff has to be 0. We can compose all these strategies into asingle winning maximizer MD strategy σ for all states in U . We now collapse allstates in U into a single gain-safe state u a with an even priority, and a self-loopwith payoff 3, resulting in the SSG G U . Now, if the maximizer can a.s. reach U in G [ τ ∗ ], then he can enforce MP ( > ∩ PAR in G U [ τ ∗ ]. All the remaininggain-safe end-components in G U satisfy MP ( > ∩ PAR and so MP ( > ∩ PAR due to Lemma 32.We therefore fix a winning maximizer MD strategy σ for MP ( >
2) and for eachMD strategy τ write a linear program, consisting of the gain-bias in equationsfor gain of at least 2 in G U [ σ, τ ], and forcing all biases to be non-negative and ofpolynomial size. This is a straight-forward adaptation of the gain-bias relations forsolving mean-payoff MDPs (see, e.g., [44, Theorem 8.2.6(a), p. 343]) In particular,we have b τ,u < b τ,τ ( u ) + r ( u, τ ( u )) − u ∈ V (cid:51) \ Ub τ,u < b τ,σ ( u ) + r ( u, σ ( u )) − u ∈ V (cid:50) \ Ub τ,u < (cid:88) ( u,v ) ∈ E λ ( u, v )( b τ,v + r ( u, v )) − u ∈ V (cid:35) \ Ub τ,u ≥ u ∈ { u a } ∪ V \ U and we pick as the objective Minimize (cid:88) u ∈{ u a }∪ V \ U b τ,u It follows from the proof of Corollory 10.2a in [47] that the size of an optimalfinite solution to such a linear program is at most 4 m ( m + 1)( S + 1), where m is the number of variables and S is the maximum size of any coefficient used. In pst t d d d pst t s τ d d d d a b s τ d a b s τ k d a k b k · · · = ⇒ Fig. 4: The reduction from G to G , in which maximizer can choose to rebalancethe rewards of edges out of probabilistic states at the cost of a reduced expectedpayoff where a i = (cid:98) b τ i ,s − b τ i ,t (cid:99) , b i = (cid:98) b τ i ,s − b τ i ,t (cid:99) and k is the numberof MD minimizer strategies.our case m ≤ | V | + 1 and S ≤ |G| , so the size of any b τ,u in an optimal finitesolution to such a linear program is of size polynomial in |G | . Note that thisloose upper bound, B , does not depend on τ , σ nor U .We now build the SSG G = ( V , E , λ ), where G ⊇ G U , and the associatedreward function r that we derive from G U by allowing maximizer to redistributethe rewards of random edges. More precisely, let s be a random state withtwo outgoing edges ( s, t ) , ( s, t ) ∈ E and a unique predecessor p ∈ V (cid:50) . Then,for every MD minimizer strategy τ , G contain an extra random state s τ andedges ( p, s τ ) , ( s τ , t ) , ( s τ , t )—with the same probabilities p and p for taking( s τ , t ) and ( s τ , t ) as for taking ( s, t ) and ( s, t ), respectively—and rewards r ( p, s τ ) def = r ( p, s ), r ( s τ , t ) def = (cid:98) b τ,s − b τ,t (cid:99) and r ( s τ , t ) def = (cid:98) b τ,s − b τ,t (cid:99) .See Fig. 4 for an example. Notice that, due to the inequalities defining the biases b τ,u , we have p r ( s τ , t )+ p r ( s τ , t )+1 < p r ( s, t )+ p r ( s, t ), so maximizersacrifices expected reward of at least 1 at s .This extended arena has the following property for every state u ∈ V of G . Lemma 35.
Let τ be an MD minimizer strategy and u a state. Then u ∈ AS G [ τ ] ( Gain ) if, and only if, u ∈ (cid:83) k ≥ AS G [ τ ] ( ST ( k ) ∩ PAR ) .Proof. ( ⇐ = ). Pick k such that u ∈ AS G [ τ ] ( ST ( k ) ∩ PAR ) holds, and let σ bean a.s. winning FD strategy for the maximizer. Now in G , we simply follow σ ,but whenever σ picks a trade-in edge to s τ , we pick the original edge to s instead.Notice that such a strategy ensures parity and the energy level at any pointcan only increase. If such a strategy reaches a node in U then it switches to anoptimal strategy for ST ( k (cid:48) ) ∩ PAR , where k (cid:48) is the minimum energy for which ST ( k (cid:48) ) ∩ PAR holds for all states in U . It is easy to see that while using such astrategy the energy can never drop more than k + k (cid:48) , so it has to satisfy Gain a.s.( = ⇒ ). First of all, note that due to the definition of biases b τ,u we havethat r ( u, u (cid:48) ) > b τ,u − b τ,u (cid:48) + 2 for u ∈ V (cid:50) ∪ V (cid:51) , and r ( u, u (cid:48) ) > b τ,u − b τ,u (cid:48) for u ∈ { s τ | s ∈ V (cid:35) } , because (cid:98) x + 1 (cid:99) > x for all x .Now pick any a.s. winning σ for Gain in G [ τ ]. Let σ (cid:48) be σ that always pickstrade-ins s τ when possible. Such a strategy still satisfies parity a.s. Consider any play ρ = s e s e s e . . . of σ (cid:48) . If ρ reaches a state in U then we switch atthat point to an optimal strategy for ST ( k (cid:48) ) ∩ PAR as defined above. Otherwise,we have that for any infix s l e l . . . e h − s h of ρ , the change in the energy level is (cid:80) h − i = l r ( s i , s i +1 ) > (cid:80) h − i = l ( b τ,s i − b τ,s i +1 ) = b τ,s h − b τ,s l ≥ − B . This shows that ST ( B + k (cid:48) ) ∩ PAR is satisfied a.s. by such a strategy.Using the existence of MD optimal minimizer strategies for their respectiveobjectives in both games, we get the following.
Corollary 36. u ∈ AS G ( Gain ) ⇐⇒ u ∈ (cid:83) k ≥ AS G ( ST ( k ) ∩ PAR ) .Proof. First of all, by the way G is defined, we have u ∈ AS G ( Gain ) ⇐⇒ u ∈ AS G ( Gain ).( ⇒ ) For all MD strategies τ the following has to hold u ∈ AS G [ τ ] ( Gain ).Due to Lemma 35 we get u ∈ (cid:83) k ≥ AS G [ τ ] ( ST ( k ) ∩ PAR ), so there exists k suchthat u ∈ AS G [ τ ] ( ST ( k ) ∩ PAR ). As there are only finitely many MD strategies,we let k ∗ be the maximum value of k corresponding to one of them. Note that u ∈ AS G ( ST ( k ∗ ) ∩ PAR ) has to hold, because u ∈ AS G [ τ ] ( ST ( k ∗ ) ∩ PAR ) forall MD strategies τ (as ST ( k ) ∩ PAR objective is upward-closed) and one of themhas to be an optimal strategy for minimizer.( ⇐ ) Suppose that u (cid:54)∈ AS G ( Gain ) then pick any MD optimal minimizer strat-egy τ such that u (cid:54)∈ AS G [ τ ] ( Gain ). Due to Lemma 35 we get u (cid:54)∈ (cid:83) k ≥ AS G [ τ ] ( ST ( k ) ∩ PAR );a contradiction with the fact that u ∈ (cid:83) k ≥ AS G ( ST ( k ) ∩ PAR ). , − , (a) The game G , − , , , − , , −
30 63 0 0 1713 (b) The derived game G Fig. 5: An example game G (left) and its example derived game G . Example 37 (continuation of Example 34).
Consider the game G in Fig. 5 (left).In its derived game G there are as many trade-in options for the random stateas there are MD minimizer’s strategies (just two in this example). The blue one(top left) corresponds to minimizer going left and the red one (top right) to goingup. Example biases that satisfy the inequalities presented in Section C.3.2 aredrawn next to the nodes inside colored boxes. They results in the rewards 4 and −
10 for the blue trade-in and 4 and − C.3.3 Concise Witnesses Construction ( G → G ) The final step is to show that we can clean up G by removing all but a smallnumber of the new trade-in options for maximizer when entering a random state,preserving the fact that maximizer wins the ST ( · ) ∩ PAR objective. Formally thiswhole subsection is dedicated to a proof of the following crucial lemma.
Lemma 38.
There exists a game G ⊇ G that results from G by keeping, forany random state, at most twice the number of states in G trade-in options, andsuch that for any state s ∈ V maximizer wins the almost-sure k -storage-paritygame in G iff he does in G . Most of the properties in this subsection hold for an arbitrary energy-paritygame, so we will use H instead of G in order to avoid the use of double subscripts.The main idea of the proof of Lemma 38 is to use the monotonicity of the ST ( k ) ∩ PAR objective with respect to the initial energy level k . If maximizera.s. wins ST ( · ) ∩ PAR from state p then there is a least k p ∈ N such that (forsome l ), ST ( k p , l ) ∩ PAR holds a.s. Fix l large enough to work for all minimal k p for every state p —and for all purposes of the proofs below.Consider a configuration ( p, k p ) ∈ AS ( ST ( k p , l ) ∩ PAR ) where p has newlyintroduced outgoing edges that allow for trade-ins (it has a random successornode). Let σ be a winning maximiser strategy for this game that depends onlyon the state and the energy level in the energy store , and let σ min denote themaximiser strategy that maps each maximiser state p to the successor that σ assigns to ( p, k p ). Note that this strategy is positional, and therefore uses onlyone possible trade-in option.We first observe that maximiser can ensure by using this strategy that he canonly gain energy distance relative to the minimal energy level of the state (exceptwhere the energy is limited by the capacity of his energy store): For every run( s , k ) , ( s , k ) , ( s , k s ) , ( s , k ) , . . . of H consistent with σ min and all i ∈ ω itholds that k i +1 − k s i +1 ≥ k i − k s i . The following lemma is a direct consequence. Lemma 39.
The strategy σ min almost-surely guarantees that 1) the cumulativerewards tend to infinity or 2) the parity condition holds. That is, for everyminimizer strategy τ and initial state s of H it holds that P H ,σ min ,τs ( LimInf (= ∞ ) ∪ PAR ) = 1 . Recall that such a strategy must exist as, once the store limit l is fixed, the gamebecomes an ordinary finite parity game.7 Proof.
Assume for contradiction that minimizer has a strategy that ensures thatruns with a positive probability weight contain (1) only finitely many transitionsthat lead to a true gain in energy (relative to the minimal energy level) and (2) donot satisfy the parity condition. (1) is a co-B¨uchi objective, (2) a parity objective,so (1) and (2) together are a parity objective. Thus, minimizer has a memorylessstrategy τ to obtain this. Thus H [ σ min , τ ] has a leaf-component where this holds.Thus, H [ σ, τ ] is not winning on the states of this leaf-component on the minimalenergy level. (contradiction)We call the property ( LimInf (= ∞ ) ∪ PAR ) established by this lemma the liftor win property and will use it for a separation of concerns. For this, we firstshow that, when the dominating priority is odd, then the maximizer can win ona smaller set that he can ensure is never left while winning the energy storagecondition almost surely.For a set S of states, we write atr H i ( S ) for the set of states from which player i ∈ { (cid:50) , (cid:51) } (maximizer / minimizer) can force the game to a state in S . Inparticular, atr H (cid:50) ( S ) = AS ( F S ) is the set of states for which maximizer can ensureto almost-surely reach S . We call a set S of states a (minimizer) trap if allminimizer states and all random states in S have only successors in S . Naturally,the union of two traps is also a trap, so there exists a unique ⊆ -maximal trap. Lemma 40.
Let H be a game with minimal odd priority o , where the maximiserwins storage parity from all positions, and let S o be the states with priority o .Then there is a trap S t in H \ atr H (cid:51) ( S o ) , such that the maximiser wins storageparity from all positions in the subgame H ∩ S t , that is, without exiting S t .Proof. Assume for contradiction that no such trap exists. Then the minimiserhas an almost-sure winning strategy—and thus a positional winning strategy τ —for all positions in H \ atr H (cid:51) ( S o ). Then the minimiser can win almost surelyin H by a positional winning strategy that fixes an arbitrary strategy for herpositions in S o , uses her attractor strategy in all her other positions in atr H (cid:51) ( S o ),and τ elsewhere. (contradiction)The minimal energy level for winning from a state in S t can, of course, differfrom the minimal sufficient energy level for the same state in the full game H .We now partition the winning regions using divide and conquer. Lemma 41.
Let H be a game where the maximiser wins storage parity from allpositions. Let o be the minimal odd priority that occurs in H . If o is the minimalpriority in H then let S t be defined as the trap S t guaranteed by Lemma 40,otherwise let S t be the set of states with smaller priority than o . The followingholds.1. Maximizer wins storage parity from all positions in the subgame H (cid:48) = H \ atr H (cid:50) ( S t ) .2. Fix maximizer strategies σ , σ and σ that are almost-sure winning for 1)storage-parity in H (cid:48) , storage-parity in S t , and 3) reachability ( F S t ), respec-tively, and let I ⊆ H be the game in which all new trade-in states that are never used by those strategies are removed. Maximizer almost-surely wins thestorage-parity objective from all states of I .Proof. For the first part, one immediately sees that a winning minimizer strategyfor (some) states in H (cid:48) would also be winning for these states in H .For the second part, notice that maximizer can combine the existing strategiesinto an overall winning strategy as follows.Suppose k t ∈ N is large enough so that for all states in S t , maximizer winsthe storage-parity objective ST ( k t ) ∩ PAR . Based on this, we can pick k a ∈ N large enough so that, in a game that starts in atr H (cid:50) ( S t ) with energy k a and wheremaximiser plays the attractor strategy towards S t , he has a positive chance ofreaching S t with energy ≥ k t while remaining in the almost-sure winning regionfor storage-parity (in H ). Finally, let k > max { k a , k t } be large enough so thatfor all states in H (cid:48) , maximizer wins the storage-parity objective ST ( k ) ∩ PAR .W.l.o.g., this is already witnessed by the strategy σ , by monotonicity of theobjective. Maximizer will play as follows.As long as the energy level is low ( < k ), maximizer plays according to σ min .By the lift or win property (Lemma 32) he can either win or gain an arbitraryamount of energy. Alternatively, assuming he is in atr H (cid:50) ( S t ) and has sufficientenergy, he invests it into an attempt to reach S t in atr H (cid:50) ( S t ) while complyingwith the minimal energy level on the way and, if o is the minimal priority, havingsufficient energy in S t to win storage parity in the trap S t . Outside of atr H (cid:50) ( S t ),he plays according to σ , the winning strategy in H (cid:48) , while maintaining an energylevel of at least k . This combined strategy is winning for the k -storage-paritybecause 1) it remains in the almost-sure winning region and 2) either eventuallyforever follows a winning strategy in S t ∪ H (cid:48) , or (in case the minimal priority iseven) infinitely often tries to reach states with the dominant priority.As this strategy only combines the existing strategies, it will never use anytrade-in state in H \ I , and therefore works in the smaller subgame I .This finally allows us to establish our main claim, of which Lemma 38 is adirect consequence. Lemma 42.
If the maximiser almost surely wins storage parity for G , he canwin storage parity in G with a strategy that does not use more choices for anystate in G than twice the number of states of G has states.Proof. The claim follows from a recursive application of Lemma 41. Startingwith
H ⊆ G defined by the almost-sure winning states, each application will splitthe game into disjoint subgames H (cid:48) , S t , and atr H (cid:50) ( S t ) \ S t , in which maximizercan be assumed to win according to simpler (wrt. the number of trade-ins used)strategies. Notice that every new trade-in states s τ ∈ G \ G will belong tothe same subgame as its accompanying original random state s ∈ G . In everydecomposition S t must be non-empty, so the number of states in G bounds therecursion depth.The base cases are either empty or games in which maximizer wins only bycombining σ min and an attractor strategy towards the dominating priority. Both can be chosen MD. In any further decomposition, any given state will witherbelong to a smaller game ( H (cid:48) or S t ), in which case the number of necessarytrade-in options is unchanged, or is in atr H (cid:50) ( S t ) \ S t , in which case the combinedstrategy may need to chose between σ min and an attractor strategy. But noticethat the choice of trade-in state is meaningless for the attractor strategy, becauseall such states have the same (distributions over) successors. , − , , , − , , −
30 63 0 0 1713 (a) The game G , − , , , − (b) The game G Fig. 6: An example game G (left) and the derived games. Example 43 (continuation of Example 37).
Consider the game G in Fig. 6 (left).We can prune G into a game where all but one new alternative state is removed.In this game G , depicted on the right, maximizer can almost-surely guaranteethe Gain condition while simultaneously ensuring that no negative cycle is closed.This means that ST ( k ) ∩ PAR holds almost-surely in G , and hence EN ( k ) ∩ PAR in G . C.3.4 Proof of Theorem 18
We are now ready to prove the main theorem of Section 5.
Theorem 18.
Checking whether a state s ∈ V of G satisfies Gain almost-surelyis in NP .Proof. Guess a game G that uses only the given bound on the number of choices,i.e., without constructing the exponentially large game G . Prune the unreachablerandom states and verify that maximizer can almost-surely ensure the storage-parity objective in G3