[PDF] Markov Decision Processes with Multiple Long-run Average Objectives

Abstract

We study Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) functions. We consider two different objectives, namely, expectation and satisfaction objectives. Given an MDP with k limit-average functions, in the expectation objective the goal is to maximize the expected limit-average value, and in the satisfaction objective the goal is to maximize the probability of runs such that the limit-average value stays above a given vector. We show that under the expectation objective, in contrast to the case of one limit-average function, both randomization and memory are necessary for strategies even for epsilon-approximation, and that finite-memory randomized strategies are sufficient for achieving Pareto optimal values. Under the satisfaction objective, in contrast to the case of one limit-average function, infinite memory is necessary for strategies achieving a specific value (i.e. randomized finite-memory strategies are not sufficient), whereas memoryless randomized strategies are sufficient for epsilon-approximation, for all epsilon>0. We further prove that the decision problems for both expectation and satisfaction objectives can be solved in polynomial time and the trade-off curve (Pareto curve) can be epsilon-approximated in time polynomial in the size of the MDP and 1/epsilon, and exponential in the number of limit-average functions, for all epsilon>0. Our analysis also reveals flaws in previous work for MDPs with multiple mean-payoff functions under the expectation objective, corrects the flaws, and allows us to obtain improved results.

Full PDF

MARKOV DECISION PROCESSES WITH MULTIPLE LONG-RUNAVERAGE OBJECTIVES

TOM ´AˇS BR ´AZDIL a , V ´ACLAV BROˇZEK, KRISHNENDU CHATTERJEE c , VOJTˇECH FOREJT d ,AND ANTON´IN KU ˇCERA ea,b,e Faculty of Informatics, Masaryk University, Brno, Czech Republic e-mail address : { brazdil,xbrozek,kucera } @ﬁ.muni.cz c IST Austria, Klosterneuburg, Austria e-mail address : [email protected] d Department of Computer Science, University of Oxford, UK e-mail address : [email protected]

Abstract.

We study Markov decision processes (MDPs) with multiple limit-average (ormean-payoﬀ) functions. We consider two diﬀerent objectives, namely, expectation andsatisfaction objectives. Given an MDP with kkk limit-average functions, in the expectationobjective the goal is to maximize the expected limit-average value, and in the satisfactionobjective the goal is to maximize the probability of runs such that the limit-average valuestays above a given vector. We show that under the expectation objective, in contrast tothe case of one limit-average function, both randomization and memory are necessary forstrategies even for εεε -approximation, and that ﬁnite-memory randomized strategies are suf-ﬁcient for achieving Pareto optimal values. Under the satisfaction objective, in contrast tothe case of one limit-average function, inﬁnite memory is necessary for strategies achievinga speciﬁc value (i.e. randomized ﬁnite-memory strategies are not suﬃcient), whereas mem-oryless randomized strategies are suﬃcient for εεε -approximation, for all ε > ε > ε >

0. We furtherprove that the decision problems for both expectation and satisfaction objectives can besolved in polynomial time and the trade-oﬀ curve (Pareto curve) can be εεε -approximatedin time polynomial in the size of the MDP and ε ε ε , and exponential in the number of limit-average functions, for all ε > ε > ε >

0. Our analysis also reveals ﬂaws in previous work for MDPswith multiple mean-payoﬀ functions under the expectation objective, corrects the ﬂaws,and allows us to obtain improved results. [ Mathematics of computing ]: Probability and statistics—Stochastic processes—Markov processes; Design and analysis of algorithms—Mathematical optimization—Continuousoptimization—Stochastic control and optimization / Convex Optimization; [

Software and its engi-neering ]: Software creation and management—Software veriﬁcation and validation—Formal softwareveriﬁcation.

Key words and phrases:

Markov decision processes, mean-payoﬀ reward, multi-objective optimisation,formal veriﬁcation.

LOGICAL METHODS l IN COMPUTER SCIENCE DOI:10.2168/LMCS-10(1:13)2014 c (cid:13)

T. Brázdil, V. Brožek, K. Chatterjee, V. Forejt, and A. Kuˇcera CC (cid:13) Creative Commons

T. BR ´AZDIL, V. BROˇZEK, K. CHATTERJEE, V. FOREJT, AND A. KUˇCERA Introduction

Markov decision processes (MDPs) are the standard models for probabilistic dynamic sys-tems that exhibit both probabilistic and nondeterministic behaviors [18, 11]. In each stateof an MDP, a controller chooses one of several actions (the nondeterministic choices), andthe system stochastically evolves to a new state based on the current state and the chosenaction. A reward (or cost) is associated with each transition and the central question is toﬁnd a strategy of choosing the actions that optimizes the rewards obtained over the runof the system. One classical way to combine the rewards over the run of the system isthe limit-average (or mean-payoﬀ ) function that assigns to every run the average of therewards over the run. MDPs with single mean-payoﬀ functions have been widely studied inliterature (see, e.g., [18, 11]). In many modeling domains, however, there is not a single goalto be optimized, but multiple, potentially dependent and conﬂicting goals. For example, indesigning a computer system, the goal is to maximize average performance while minimizingaverage power consumption. Similarly, in an inventory management system, the goal is tooptimize several potentially dependent costs for maintaining each kind of product. Thesemotivate the study of MDPs with multiple mean-payoﬀ functions.Traditionally, MDPs with mean-payoﬀ functions have been studied with only the ex-pectation objective, where the goal is to maximize (or minimize) the expectation of themean-payoﬀ function. There are numerous applications of MDPs with expectation objec-tives in inventory control, planning, and performance evaluation [18, 11]. In this work weconsider both the expectation objective and also the satisfaction objective for a given MDP.In both cases we are given an MDP with k reward functions, and the goal is to maximize(or minimize) either the k -tuple of expectations, or the probability of runs such that themean-payoﬀ value stays above a given vector.To get some intuition about the diﬀerence between the expectation/satisfaction objec-tives and to show that in some scenarios the satisfaction objective is preferable, consider aﬁlehosting system where the users can download ﬁles at various speed, depending on thecurrent setup and the number of connected customers. For simplicity, let us assume that auser has 20% chance to get a 2000kB/sec connection, and 80% chance to get a slow 20kB/secconnection. Then, the overall performance of the server can be reasonably measured by theexpected amount of transferred data per user and second (i.e., the expected mean payoﬀ)which is 416kB/sec. However, a single user is more interested in her chance of downloadingthe ﬁles quickly, which can be measured by the probability of establishing and maintaininga reasonably fast connection (say, ≥ achievable solutions (i) under the expectation objective is the set of all vectors ~v such that there is a strategyto ensure that the expected mean-payoﬀ value vector under the strategy is at least ~v ; DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 3 (ii) under the satisfaction objective is the set of tuples ( ν, ~v ) where ν ∈ [0 ,

1] and ~v isa vector such that there is a strategy under which with probability at least ν themean-payoﬀ value vector of a run is at least ~v .The “trade-oﬀs” among the goals represented by the individual mean-payoﬀ functions areformally captured by the Pareto curve , which consists of all minimal tuples (wrt. compo-nentwise ordering) that are not strictly dominated by any achievable solution. Intuitively,the Pareto curve consists of “limits” of achievable solutions, and in principle it may containtuples that are not achievable solutions (see Section 3). Pareto optimality has been studiedin cooperative game theory [16] and in multi-criterion optimization and decision making inboth economics and engineering [14, 21, 20].Our study of MDPs with multiple mean-payoﬀ functions is motivated by the followingfundamental questions, which concern both basic properties and algorithmic aspects of theexpectation/satisfaction objectives:Q.1 What type of strategies is suﬃcient (and necessary) for achievable solutions?Q.2 Are the elements of the Pareto curve achievable solutions?Q.3 Is it decidable whether a given vector represents an achievable solution?Q.4 Given an achievable solution, is it possible to compute a strategy which achieves thissolution?Q.5 Is it decidable whether a given vector belongs to the Pareto curve?Q.6 Is it possible to compute a ﬁnite representation/approximation of the Pareto curve?We provide comprehensive answers to the above questions, both for the expectation and thesatisfaction objective. We also analyze the complexity of the problems given in Q.3–Q.6.From a practical point of view, it is particularly encouraging that most of the consideredproblems turn out to be solvable eﬃciently , i.e., in polynomial time. More concretely, ouranswers to Q.1–Q.6 are the following:1.a For the expectation objectives, ﬁnite-memory randomized strategies are suﬃcient andnecessary for all achievable solutions. Memory and randomization may also be neededto approximate an achievable solution up to ε for a given ε > ε > ε -approximates a given solution iscomputable in polynomial time.5. The problem whether a given vector belongs to the Pareto curve is solvable in polyno-mial time.6. A ﬁnite description of the Pareto curve is computable in exponential time. Further, an ε -approximate Pareto curve is computable in time which is polynomial in 1 /ε , the sizeof a given MDP and the maximal absolute value of a reward assigned, and exponentialin the number of mean-payoﬀ functions.A more detailed and precise explanation of our results is postponed to Section 3. T. BR ´AZDIL, V. BROˇZEK, K. CHATTERJEE, V. FOREJT, AND A. KUˇCERA

Let us note that MDPs with multiple mean-payoﬀ functions under the expectationobjective were also studied in [7], and it was claimed that memoryless randomized strategiesare suﬃcient for ε -approximation of the Pareto curve, for all ε >

0, and an NP algorithmwas presented to ﬁnd a memoryless randomized strategy achieving a given vector. Weshow with an example that under the expectation objective there exists ε > do require memory for ε -approximation, and thus reveal a ﬂaw in theearlier paper.Similarly to the related papers [8, 10, 12] (see Related Work), we obtain our results bya characterization of the set of achievable solutions by a set of linear constraints, and fromthe linear constraints we construct witness strategies for any achievable solution. However,our approach diﬀers signiﬁcantly from the previous work. In all the previous works, thelinear constraints are used to encode a memoryless strategy either directly for the MDP [8],or (if memoryless strategies do not suﬃce in general) for a ﬁnite “product” of the MDPand the speciﬁcation function expressed as automata, from which the memoryless strategyis then transferred to a ﬁnite-memory strategy for the original MDP [10, 12, 9]. In oursetting new problems arise. Under the expectation objective with mean-payoﬀ function,neither is there any immediate notion of “product” of MDP and mean-payoﬀ function andnor do memoryless strategies suﬃce. Moreover, even for memoryless strategies the linearconstraint characterization is not straightforward for mean-payoﬀ functions, as in the caseof discounted [8], reachability [10] and total reward functions [12]: for example, in [7] evenfor memoryless strategies there was no linear constraint characterization for mean-payoﬀfunction and only an NP algorithm was given. Our result, obtained by a characterization oflinear constraints directly on the original MDP, requires involved and intricate constructionof witness strategies. Moreover, our results are signiﬁcant and non-trivial generalizationsof the classical results for MDPs with a single mean-payoﬀ function, where memorylesspure optimal strategies exist, while for multiple functions both randomization and memoryis necessary. Under the satisfaction objective, any ﬁnite product on which a memorylessstrategy would exist is not feasible as in general witness strategies for achievable solutionsmay need an inﬁnite amount of memory. We establish a correspondence between the set ofachievable solutions under both types of objectives for strongly connected MDPs. Finally,we use this correspondence to obtain our result for satisfaction objectives.A conference version of this work was published at the conference LICS 2011 [3]. Related Work.

The study of Markov decision processes with multiple expectation ob-jectives has been initiated in the area of applied probability theory, where it is known as constrained MDPs [18, 1]. The attention in the study of constrained MDPs has been fo-cused mainly to restricted classes of MDPs, such as unichain MDPs where all states arevisited inﬁnitely often under any strategy. Such restriction both guarantees the existenceof memoryless optimal strategies as well as simpler linear programming based algorithm forthe problem, than the general case studied in this paper.For general ﬁnite-state MDPs, [8] studied MDPs with multiple discounted reward func-tions. It was shown that memoryless strategies suﬃce for Pareto optimization, and apolynomial-time algorithm was given to approximate (up to a given relative error) thePareto curve by reduction to multi-objective linear programming and using the resultsof [17]. MDPs with multiple qualitative ω -regular speciﬁcations were studied in [10]. Itwas shown that the Pareto curve can be approximated in polynomial time; the algorithmreduces the problem to MDPs with multiple reachability speciﬁcations, which can be solvedby multi-objective linear programming. In [12], the results of [10] were extended to combine DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 5 ω -regular and expected total reward objectives. MDPs with multiple mean-payoﬀ functionsunder expectation objectives were considered in [7], and our analysis reveals ﬂaws in theearlier paper, correct the ﬂaws, and allows us to present signiﬁcantly improved results (apolynomial-time algorithm for ﬁnding a strategy achieving a given vector as compared tothe previously suggested incorrect NP algorithm). Moreover, the satisfaction objective hasnot been considered in multi-objective setting before, and even in single objective case ithas been considered only in a very speciﬁc setting [4].2. Preliminaries

We use N , Z , Q , and R to denote the sets of positive integers, integers, rational numbers,and real numbers, respectively. Given two vectors ~v, ~u ∈ R k , where k ∈ N , we write ~v ≤ ~u iﬀ ~v i ≤ ~u i for all 1 ≤ i ≤ k , and ~v < ~u iﬀ ~v ≤ ~u and ~v i < ~u i for some 1 ≤ i ≤ k .We assume familiarity with basic notions of probability theory, e.g., probability space , random variable , or expected value . As usual, a probability distribution over a ﬁnite orcountably inﬁnite set X is a function f : X → [0 ,

1] such that P x ∈ X f ( x ) = 1. We call f positive if f ( x ) > x ∈ X , rational if f ( x ) ∈ Q for every x ∈ X , and Dirac if f ( x ) = 1 for some x ∈ X . The set of all distributions over X is denoted by dist ( X ). Markov chains. A Markov chain is a tuple M = ( L, → , µ ) where L is a ﬁnite or countablyinﬁnite set of locations, → ⊆ L × (0 , × L is a transition relation such that for each ﬁxed ℓ ∈ L , P ℓ x → ℓ ′ x = 1, and µ is the initial probability distribution on L .A run in M is an inﬁnite sequence ω = ℓ ℓ . . . of locations such that ℓ i x → ℓ i +1 for every i ∈ N . A ﬁnite path in M is a ﬁnite preﬁx of a run. Each ﬁnite path w in M determinesthe set Cone ( w ) consisting of all runs that start with w . To M we associate the probabilityspace ( Runs M , F , P ), where Runs M is the set of all runs in M , F is the σ -ﬁeld generatedby all Cone ( w ), and P is the unique probability measure such that P ( Cone ( ℓ , . . . , ℓ k )) = µ ( ℓ ) · Q k − i =1 x i , where ℓ i x i → ℓ i +1 for all 1 ≤ i < k (the empty product is equal to 1). Markov decision processes. A Markov decision process (MDP) is a tuple of the form G = ( S, A,

Act , δ ) where S is a ﬁnite set of states, A is a ﬁnite set of actions, Act : S → A \ {∅} is an action enabledness function that assigns to each state s the set Act ( s ) ofactions enabled at s , and δ : S × A → dist ( S ) is a probabilistic transition function thatgiven a state s and an action a ∈ Act ( s ) enabled at s gives a probability distribution over thesuccessor states. For simplicity, we assume that every action is enabled in exactly one state,and we denote this state Src ( a ). Thus, henceforth we will assume that δ : A → dist ( S ).A run in G is an inﬁnite alternating sequence of states and actions ω = s a s a . . . such that for all i ≥ Src ( a i ) = s i and δ ( a i )( s i +1 ) >

0. We denote by

Runs G the set of allruns in G . A ﬁnite path of length k in G is a ﬁnite preﬁx w = s a . . . a k − s k of a run in G .For a ﬁnite path w we denote by last ( w ) the last state of w .A pair ( T, B ) with ∅ 6 = T ⊆ S and B ⊆ S t ∈ T Act ( t ) is an end component of G if (1)for all a ∈ B , whenever δ ( a )( s ′ ) > s ′ ∈ T ; and (2) for all s, t ∈ T there is a ﬁnitepath ω = s a . . . a k − s k such that s = s , s k = t , and all states and actions that appear in ω belong to T and B , respectively. An end component ( T, B ) is a maximal end component(MEC) if it is maximal wrt. pointwise subset ordering. Given an end component C = ( T, B ),we sometimes abuse notation by using C instead of T or B , e.g., by writing a ∈ C insteadof a ∈ B for a ∈ A . T. BR ´AZDIL, V. BROˇZEK, K. CHATTERJEE, V. FOREJT, AND A. KUˇCERA

Strategies and plays.

Intuitively, a strategy in an MDP G is a “recipe” to choose actions.Usually, a strategy is formally deﬁned as a function σ : ( SA ) ∗ S → dist ( A ) that given a ﬁnitepath w , representing the history of a play, gives a probability distribution over the actionsenabled in last ( w ). In this paper, we adopt a somewhat diﬀerent (though equivalent – seeSection 6) deﬁnition, which allows a more natural classiﬁcation of various strategy types.Let M be a ﬁnite or countably inﬁnite set of memory elements . A strategy is a triple σ = ( σ u , σ n , α ), where σ u : A × S × M → dist ( M ) and σ n : S × M → dist ( A ) are memoryupdate and next move functions, respectively, and α is an initial distribution on memoryelements. We require that for all ( s, m ) ∈ S × M , the distribution σ n ( s, m ) assigns a positivevalue only to actions enabled at s . The set of all strategies is denoted by Σ (the underlyingMDP G will be always clear from the context).Let s ∈ S be an initial state. A play of G determined by s and a strategy σ is aMarkov chain G σs (or just G σ if s is clear from the context) where the set of locations is S × M × A , the initial distribution µ is positive only on (some) elements of { s } × M × A where µ ( s, m, a ) = α ( m ) · σ n ( s, m )( a ), and ( t, m, a ) x → ( t ′ , m ′ , a ′ ) iﬀ x = δ ( a )( t ′ ) · σ u ( a, t ′ , m )( m ′ ) · σ n ( t ′ , m ′ )( a ′ ) > . Hence, G σs starts in a location chosen randomly according to α and σ n . In a currentlocation ( t, m, a ), the next action to be performed is a , hence the probability of entering t ′ is δ ( a )( t ′ ). The probability of updating the memory to m ′ is σ u ( a, t ′ , m )( m ′ ), and theprobability of selecting a ′ as the next action is σ n ( t ′ , m ′ )( a ′ ). We assume that these choicesare independent, and thus obtain the product above.In this paper, we consider various functions over Runs G that become random variablesover Runs G σs after ﬁxing some σ and s . For example, for F ⊆ S we denote by Reach ( F ) ⊆ Runs G the set of all runs reaching F . Then Reach ( F ) naturally determines Reach σs ( F ) ⊆ Runs G σs by simply “ignoring” the visited memory elements. To simplify and unify ournotation, we write, e.g., P σs [ Reach ( F )] instead of P σs [ Reach σs ( F )], where P σs is the probabilitymeasure of the probability space associated to G σs . We also adopt this notation for otherevents and functions, such as lr inf ( ~r ) or lr sup ( ~r ) deﬁned in the next section, and write, e.g., E σs [lr inf ( ~r )] instead of E [lr inf ( ~r ) σs ]. Strategy types.

In general, a strategy may use inﬁnite memory, and both σ u and σ n mayrandomize. According to the use of randomization, a strategy, σ , can be classiﬁed as • pure (or deterministic ), if α is Dirac and both the memory update and the next movefunction give a Dirac distribution for every argument; • deterministic-update , if α is Dirac and the memory update function gives a Dirac distri-bution for every argument; • stochastic-update , if α , σ u , and σ n are unrestricted.Note that every pure strategy is deterministic-update, and every deterministic-update strat-egy is stochastic-update. A randomized strategy is a strategy which is not necessarily pure.We also classify the strategies according to the size of memory they use. Important sub-classes are memoryless strategies, in which M is a singleton, n -memory strategies, in which M has exactly n elements, and ﬁnite-memory strategies, in which M is ﬁnite. By Σ M we denote the set of all memoryless strategies. Memoryless strategies can be speciﬁed as σ : S → dist ( A ). Memoryless pure strategies, i.e., those which are both pure and memoryless,can be speciﬁed as σ : S → A . DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 7 s s s s a a . . a . . a a a ( s , m , a ) ( s , m , a )( s , m , a ) ( s , m , a )( s , m , a ) ( s , m , a )0 . . . . . . . . Figure 1.

Running example MDP (left) and its play (right)For a ﬁnite-memory strategy σ , a bottom strongly connected component (BSCC) of G σs is a subset of locations W ⊆ S × M × A such that for all ℓ ∈ W and ℓ ∈ S × M × A we have that (i) if ℓ is reachable from ℓ , then ℓ ∈ W , and (ii) for all ℓ , ℓ ∈ W wehave that ℓ is reachable from ℓ . Every BSCC W determines a unique end component( { s | ( s, m, a ) ∈ W } , { a | ( s, m, a ) ∈ W } ) of G , and we sometimes do not strictly distinguishbetween W and its associated end component.As we already noted, stochastic-update strategies can be easily translated into “ordi-nary” strategies of the form σ : ( SA ) ∗ S → dist ( A ), and vice versa (see Section 6). Notethat a ﬁnite-memory stochastic-update strategy σ can be easily implemented by a stochasticﬁnite-state automaton that scans the history of a play “on the ﬂy” (in fact, G σs simulatesthis automaton). Hence, ﬁnite-memory stochastic-update strategies can be seen as nat-ural extensions of ordinary (i.e., deterministic-update) ﬁnite-memory strategies that areimplemented by deterministic ﬁnite-state automata. A running example (I).

As an example, consider the MDP G = ( S, A,

Act , δ ) of Fig-ure 1 (left). Here, S = { s , . . . , s } , A = { a , . . . , a } , Act is denoted using the labelson lines going from actions, e.g.,

Act ( s ) = { a , a } , and δ is given by the arrows, e.g., δ ( a )( s ) = 0 .

3. Note that G has four end components (one on { s } , another on { s } , andtwo on { s , s } ) and two MECs.Let s be the initial state and M = { m , m } . Consider a stochastic-update ﬁnite-memory strategy σ = ( σ u , σ n , α ) where α chooses m deterministically, and σ n ( m , s ) =[ a . , a . σ n ( m , s ) = [ a

1] and otherwise σ n chooses self-loops. Thememory update function σ u leaves the memory intact except for the case σ u ( m , s ) whereboth m and m are chosen with probability 0 .

5. The play G σs is depicted in Figure 1 (right).3. Main Results

In this paper we establish basic results about Markov decision processes with expectation and satisfaction objectives speciﬁed by multiple limit-average (or mean-payoﬀ ) functions.We adopt the variant where rewards are assigned to edges (i.e., actions) rather than statesof a given MDP.Let G = ( S, A,

Act , δ ) be a MDP, and r : A → Q a reward function . Note that r may also take negative values. For every j ∈ N , let A j : Runs G → A be a function whichto every run ω ∈ Runs G assigns the j -th action of ω . Since the limit-average functionlr( r ) : Runs G → R given by lr( r )( ω ) = lim T →∞ T T X t =1 r ( A t ( ω )) T. BR ´AZDIL, V. BROˇZEK, K. CHATTERJEE, V. FOREJT, AND A. KUˇCERA may be undeﬁned for some runs, we consider its lower and upper approximation lr inf ( r ) andlr sup ( r ) that are deﬁned for all ω ∈ Runs as follows:lr inf ( r )( ω ) = lim inf T →∞ T T X t =1 r ( A t ( ω )) , lr sup ( r )( ω ) = lim sup T →∞ T T X t =1 r ( A t ( ω )) . For a vector ~r = ( r , . . . , r k ) of reward functions, we similarly deﬁne the R k -valued functionslr( ~r ) = (lr( r ) , . . . , lr( r k )) , lr inf ( ~r ) = (lr inf ( r ) , . . . , lr inf ( r k )) , lr sup ( ~r ) = (lr sup ( r ) , . . . , lr sup ( r k )) . We sometimes refer to “runs satisfying lr( ~r ) ≥ ~v ” instead of “runs ω satisfying lr( ~r )( ω ) ≥ ~v ”.Now we introduce the expectation and satisfaction objectives determined by ~r . • The expectation objective amounts to maximizing or minimizing the expected value oflr( ~r ). Since lr( ~r ) may be undeﬁned for some runs, we actually aim at maximizing theexpected value of lr inf ( ~r ) or minimizing the expected value of lr sup ( ~r ) (wrt. componentwiseordering ≤ ). • The satisfaction objective means maximizing the probability of all runs where lr( ~r ) staysabove or below a given vector ~v . Technically, we aim at maximizing the probability of allruns where lr inf ( ~r ) ≥ ~v , or at maximizing the probability of all runs where lr sup ( ~r ) ≤ ~v .The expectation objective is relevant in situations when we are interested in the averageor aggregate behaviour of many instances of a system, and in contrast, the satisfactionobjective is relevant when we are interested in particular executions of a system and wishto optimize the probability of generating the desired executions. Since lr inf ( ~r ) = − lr sup ( − ~r ),the problems of maximizing and minimizing the expected value of lr inf ( ~r ) and lr sup ( ~r ) aredual. Therefore, we consider just the problem of maximizing the expected value of lr inf ( ~r ).For the same reason, we consider only the problem of maximizing the probability of all runswhere lr inf ( ~r ) ≥ ~v .If k (the dimension of ~r ) is at least two, there might be several incomparable solutions tothe expectation objective; and if ~v is slightly changed, the achievable probability of all runssatisfying lr inf ( ~r ) ≥ ~v may change considerably. Therefore, we aim not only at constructinga particular solution, but on characterizing and approximating the whole space of achievablesolutions for the expectation/satisfaction objective. Let s ∈ S be some (initial) state of G .We deﬁne the sets AcEx (lr inf ( ~r )) and AcSt (lr inf ( ~r )) of achievable vectors for the expectationand satisfaction objectives as follows: AcEx (lr inf ( ~r ))= { ~v | ∃ σ ∈ Σ : E σs [lr inf ( ~r )] ≥ ~v } , AcSt (lr inf ( ~r ))= { ( ν, ~v ) | ∃ σ ∈ Σ : P σs [lr inf ( ~r ) ≥ ~v ] ≥ ν } . Intuitively, if ~v, ~u are achievable vectors such that ~v > ~u , then ~v represents a “strictly better”solution than ~u . The set of “optimal” solutions deﬁnes the Pareto curve for

AcEx (lr inf ( ~r ))and AcSt (lr inf ( ~r )). In general, the Pareto curve for a given set Q ⊆ R k is the set P of allminimal vectors ~v ∈ R k such ~v < ~u for all ~u ∈ Q . Note that P may contain vectors thatare not in Q (for example, if Q = { x ∈ R | x < } , then P = { } ). However, every vector ~v ∈ P is “almost” in Q in the sense that for every ε > ~u ∈ Q with ~v ≤ ~u + ~ε , DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 9 s s ab b Figure 2.

Example of insuﬃciency of memoryless strategieswhere ~ε = ( ε, . . . , ε ). This naturally leads to the notion of an ε -approximate Pareto curve , P ε , which is a subset of Q such that for all vectors ~v ∈ P of the Pareto curve there is avector ~u ∈ P ε such that ~v ≤ ~u + ~ε . Note that P ε is not unique. A running example (II).

Consider again the MDP G of Figure 1 (left), and the strategy σ constructed in our running example (I). Let ~r = ( r , r ), where r ( a ) = 1, r ( a ) = 2, r ( a ) = 1, and otherwise the rewards are zero. Let ω = ( s , m , a )( s , m , a ) (cid:0) ( s , m , a )( s , m , a ) (cid:1) ω Then lr( ~r )( ω ) = (0 . , . E σs [lr inf ( ~r )] =( , ). Considering the satisfaction objective, we have that (0 . , , ∈ AcSt ( ~r ) be-cause P σs [lr inf ( ~r ) ≥ (0 , .

5. The Pareto curve for

AcEx (lr inf ( ~r )) consists of the points { ( x, x + 2(1 − x )) | ≤ x ≤ . } , and the Pareto curve for AcSt (lr inf ( ~r )) is { (1 , , } ∪{ (0 . , x, − x ) | < x ≤ } .Now we are equipped with all the notions needed for understanding the main results ofthis paper. Our work is motivated by the six fundamental questions given in Section 1. Inthe next subsections we give detailed answers to these questions.3.1. Expectation objectives.

The answers to Q.1-Q.6 for the expectation objectives arethe following:A.1 For all achievable solutions, 2-memory stochastic-update strategies are suﬃcient, i.e.,for all ~v ∈ AcEx (lr inf ( ~r )) there is a 2-memory stochastic-update strategy σ satisfying E σs [lr inf ( ~r )] ≥ ~v .A.2 The Pareto curve P for AcEx (lr inf ( ~r )) is a subset of AcEx (lr inf ( ~r )), i.e., all optimalsolutions are achievable.A.3 There is a polynomial-time algorithm which, given any ~v ∈ Q k , decides whether ~v ∈ AcEx (lr inf ( ~r )).A.4 If ~v ∈ AcEx (lr inf ( ~r )), then there is a 2-memory stochastic-update strategy σ con-structible in polynomial time satisfying E σs [lr inf ( ~r )] ≥ ~v .A.5 There is a polynomial-time algorithm which, given ~v ∈ R k , decides whether ~v belongsto the Pareto curve for AcEx (lr inf ( ~r )).A.6 There is a convex hull Z of ﬁnitely many vectors such that: AcEx (lr inf ( ~r )) is a downwardclosure of Z (i.e. AcEx (lr inf ( ~r )) = { ~v | ∃ ~u ∈ Z : ~v ≤ ~u } ); The Pareto curve for AcEx (lr inf ( ~r )) is a union of all facets of Z whose vectors are not strictly dominated byvectors of Z . Further, an ε -approximate Pareto curve for AcEx (lr inf ( ~r )) is computablein time polynomial in ε , | G | , and max a ∈ A max ≤ i ≤ k | ~r i ( a ) | , and exponential in k .Let us note that A.1 is tight in the sense that neither memoryless randomized nor purestrategies are suﬃcient for achievable solutions. This is witnessed by the MDP of Figure 2with reward functions r , r such that r i ( b i ) = 1 and r i ( b j ) = 0 for i = j . Consider a strategy σ which initially selects between the actions b and a randomly (with probability 0 .

5) and then keeps selecting b or b , whichever is available. Hence, E σs [lr inf (( r , r ))] = (0 . , . . , .

5) is not achievable by a strategy σ ′ which is memoryless or pure,because then we inevitably have that E σ ′ s [lr inf (( r , r ))] is equal either to (0 ,

1) or (1 , ε -approximation.Considering e.g. ε = 0 .

1, a history-dependent randomized strategy is needed to achieve thevalue (0 . − . , . − .

1) or better.The 2-memory stochastic-update strategy from A.1 and A.4 operates in two modes.Starting in the ﬁrst mode, it reaches the MECs of the MDP with appropriate probabilities;once a MEC is reached, the strategy stochastically switches to a second mode, never leavingthe current MEC and ensuring certain “frequencies” of taking the actions of the MEC. Sinceboth modes can be implemented by memoryless strategies, we get that we only require twomemory elements to remember which mode is currently being executed. We also showthat the 2-memory stochastic-update strategy constructed can be eﬃciently transformedinto a ﬁnite-memory deterministic-update randomized strategy, and hence the answers A.1and A.4 are also valid for ﬁnite-memory deterministic-update randomized strategies (seeSection 4.1). Observe that A.2 can be seen as a generalization of the well-known result forsingle payoﬀ functions which says that ﬁnite-state MDPs with mean-payoﬀ objectives haveoptimal strategies (in this case, the Pareto curve consists of a single number known as the“value”). Also observe that A.2 does not hold for inﬁnite-state MDPs (a counterexample issimple to construct even for a single reachability objective, see e.g. [5, Example 6]).Finally, note that if σ is a ﬁnite-memory stochastic-update strategy, then G σs is a ﬁnite-state Markov chain. Hence, for almost all runs ω in G σs we have that lr( ~r )( ω ) exists and itis equal to lr inf ( ~r )( ω ). This means that there is actually no diﬀerence between maximizingthe expected value of lr inf ( ~r ) and maximizing the expected value of lr( ~r ) over all strategiesfor which lr( ~r ) exists.3.2. Satisfaction objectives.

The answers to Q.1-Q.6 for the satisfaction objectives arepresented below.B.1 Achievable vectors require strategies with inﬁnite memory in general. However, mem-oryless randomized strategies are suﬃcient for ε -approximate achievable vectors; infact, a stronger claim holds and for every ε > ν, ~v ) ∈ AcSt (lr inf ( ~r )), there is amemoryless randomized strategy σ with P σs [lr inf ( ~r ) ≥ ~v − ~ε ] ≥ ν. Here ~ε = ( ε, . . . , ε ).B.2 The Pareto curve P for AcSt (lr inf ( ~r )) is a subset of AcSt (lr inf ( ~r )), i.e., all optimalsolutions are achievable.B.3 There is a polynomial-time algorithm which, given ν ∈ [0 ,

1] and ~v ∈ Q k , decideswhether ( ν, ~v ) ∈ AcSt (lr inf ( ~r )).B.4 If ( ν, ~v ) ∈ AcSt (lr inf ( ~r )), then for every ε > σ constructible in polynomial time such that P σs [lr inf ( ~r ) ≥ ~v − ~ε ] ≥ ν − ε .B.5 There is a polynomial-time algorithm which, given ν ∈ [0 ,

1] and ~v ∈ R k , decideswhether ( ν, ~v ) belongs to the Pareto curve for AcSt (lr inf ( ~r )).B.6 The Pareto curve P for AcSt (lr inf ( ~r )) may be neither connected, nor closed. However, P is a union of ﬁnitely many sets whose closures are convex polytopes, and, perhapssurprisingly, the set { ν | ( ν, ~v ) ∈ P } is always ﬁnite. The sets in the union that DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 11 s ( s ) + X a ∈ A y a · δ ( a )( s ) = X a ∈ Act ( s ) y a + y s for all s ∈ S (4.1) X s ∈ S MEC y s = 1 (4.2) X s ∈ C y s = X a ∈ A ∩ C x a for all MEC C of G (4.3) X a ∈ A x a · δ ( a )( s ) = X a ∈ Act ( s ) x a for all s ∈ S (4.4) X a ∈ A x a · ~r i ( a ) ≥ ~v i for all 1 ≤ i ≤ k (4.5) Figure 3.

System L of linear inequalities for Theorem 4.1. (We deﬁne S MEC ⊆ S to be the states contained in some MEC of G , s ( s ) = 1 if s = s , and s ( s ) = 0 otherwise.)gives P (resp. the inequalities that deﬁne them) can be computed. Further, an ε -approximate Pareto curve for AcSt (lr inf ( ~r )) is computable in time polynomial in ε , | G | ,and max a ∈ A max ≤ i ≤ k | ~r i ( a ) | , and exponential in k .The algorithms of B.3 and B.4 are polynomial in the size of G and the size of binaryrepresentations of ~v and ε .The result B.1 is again tight. In Lemma 5.2 we show that memoryless pure strategiesare insuﬃcient for ε -approximate achievable vectors, i.e., there are ε > ν, ~v ) ∈ AcSt (lr inf ( ~r )) such that for every memoryless pure strategy σ we have P σs [lr inf ( ~r ) ≥ ~v − ~ε ] <ν − ε .As noted in B.1, a strategy σ achieving a given vector ( ν, ~v ) ∈ AcSt (lr inf ( ~r )) may requireinﬁnite memory. Still, our proof of B.1 reveals a “recipe” for constructing such a σ bysimulating the memoryless randomized strategies σ ε which ε -approximate ( ν, ~v ) (intuitively,for smaller and smaller ε , the strategy σ simulates σ ε longer and longer; the details arediscussed in Section 5). Hence, for almost all runs ω in G σs we again have that lr( ~r )( ω )exists and it is equal to lr inf ( ~r )( ω ).4. Solution for Expectation Objectives

The technical core of our results for expectation objectives is the following:

Theorem 4.1.

Let G = ( S, A,

Act , δ ) be an MDP, s ∈ S an initial state, ~r = ( r , . . . , r k ) a tuple of reward functions, and ~v ∈ R k . The system of linear inequalities L from Figure 3is constructible in polynomial time and satisﬁes: • every nonnegative solution of L induces a -memory stochastic-update strategy σ satisfying E σs [lr inf ( ~r )] ≥ ~v ; • if ~v ∈ AcEx (lr inf ( ~r )) , then L has a nonnegative solution. As we already noted in Section 1, the proof of Theorem 4.1 is non-trivial and it isbased on novel techniques and observations. Our results about expectation objectives arecorollaries to Theorem 4.1 and the arguments developed in its proof. For the rest of thissection, we ﬁx an MDP G , a vector of rewards, ~r = ( r , . . . , r k ), and an initial state s (in the considered plays of G , the initial state is not written explicitly, unless it is diﬀerent from s ). Obviously, L is constructible in polynomial time. Let us brieﬂy explain the intuitionbehind L . As mentioned earlier, a 2-memory stochastic-update strategy witnessing that ~v ∈ AcEx (lr inf ( ~r )) works in two modes. In the ﬁrst mode it ensures that each MEC isreached and never left with certain probability, and in the second mode actions are takenwith required frequencies. In L , the probability of reaching a MEC C is encoded as thevalue P s ∈ C y s , and Equations (4.1) are used to ensure that the numbers obtained are indeedrealisable under some strategy. The meaning of these equations is similar as the meaningof similar equations in [10], essentially the equations encode that the expected number oftimes a state is entered (left-hand side of the equations) is equal to the expected number oftimes a state is left together with probability of switching to the second mode (right-handside of the equations). A more formal explanation of these equations is given at the endof the proof of Proposition 4.5. The frequency of taking an action a is then encoded as x a ,and realisability of the solution by some strategy is ensured using Equations (4.4). Herethe meaning of the equations is that the frequency with which a state is entered must beequal to the frequency with which it is left; this is formalised in Lemma 4.3.As both directions of Theorem 4.1 are technically involved, we prove them separatelyas Propositions 4.2 and 4.5. Proposition 4.2.

Every nonnegative solution of the system L of Figure 3 induces a -memory stochastic-update strategy σ satisfying E σs [lr inf ( ~r )] ≥ ~v .Proof of Proposition 4.2. First, let us consider Equations (4.4) of L . Intuitively, this equa-tion is solved by an “invariant” distribution on actions, i.e., each solution gives frequenciesof actions (up to a multiplicative constant) deﬁned for all a ∈ A , s ∈ S , and σ ∈ Σ byfreq( σ, s, a ) := lim T →∞ T T X t =1 P σs [ A t = a ] , assuming that the deﬁning limit exists (which might not be the case—cf. the proof ofProposition 4.5). We prove the following: Lemma 4.3.

Assume that assigning (nonnegative) values ¯ x a to x a solves Equations (4.4).Then there is a memoryless strategy ξ such that for every BSCCs D of G ξ , every s ∈ D ∩ S ,and every a ∈ D ∩ A , we have that freq( ξ, s, a ) equals a common value freq( ξ, D, a ) :=¯ x a / P a ′ ∈ D ∩ A ¯ x a ′ .Proof. For all s ∈ S we set ¯ x s = P b ∈ Act ( s ) ¯ x b and deﬁne ξ by ξ ( s )( a ) := ¯ x a ¯ x s if ¯ x s >

0, andarbitrarily otherwise. We claim that the vector of values ¯ x s forms an invariant measureof G ξ . Indeed, noting that P a ∈ Act ( s ) ξ ( s )( a ) · δ ( a )( s ′ ) is the probability of the transition DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 13 s → s ′ in G ξ : X s ∈ S ¯ x s · X a ∈ Act ( s ) ξ ( s )( a ) · δ ( a )( s ′ ) = X s ∈ S X a ∈ Act ( s ) ¯ x s · ¯ x a ¯ x s · δ ( a )( s ′ )= X a ∈ A ¯ x a · δ ( a )( s ′ )= X a ∈ Act ( s ′ ) ¯ x a (By Equation 4.4)= ¯ x s ′ . As a consequence, ¯ x s > s lies in some BSCC of G ξ . Choose some BSCC D , and denoteby ¯ x D the number P a ∈ D ∩ A ¯ x a = P s ∈ D ∩ S ¯ x s . Also denote by I at the indicator of A t = a ,given by I at = 1 if A t = a and 0 otherwise. By the Ergodic theorem for ﬁnite Markov chains(see, e.g. [15, Theorem 1.10.2]), for all s ∈ D ∩ S and a ∈ D ∩ A we have E ξs " lim T →∞ T T X t =1 I at = X s ′ ∈ D ∩ S ¯ x s ′ ¯ x D · ξ ( s ′ )( a ) = ¯ x s ′ ¯ x D · ¯ x a ¯ x s ′ = ¯ x a ¯ x D . Because | I at | ≤

1, Lebesgue Dominated convergence theorem (see, e.g. [19, Chapter 4, Sec-tion 4]) yields E ξs h lim T →∞ T P Tt =1 I at i = lim T →∞ T P Tt =1 E ξs [ I at ] and thus freq( ξ, s, a ) = ¯ x a ¯ x D = freq( ξ, D, a ) . This ﬁnishes the proof of Lemma 4.3.Assume that the system L is solved by assigning nonnegative values ¯ x a to x a and ¯ y χ to y χ where χ ∈ A ∪ S . W.l.o.g. assume that ¯ y s = 0 for all states s not contained in anyMEC. Let ξ be the strategy of Lemma 4.3. Using Equations (4.1), (4.2), and (4.3), we willdeﬁne a 2-memory stochastic update strategy σ as follows. The strategy σ has two memoryelements, m and m . A run of G σ starts in s with a given distribution on memory elements(see below). Then σ plays according to a suitable memoryless strategy (constructed below)until the memory changes to m , and then it starts behaving as ξ forever. Given a BSCC D of G ξ , we denote by P σs [switch to ξ in D ] the probability that σ switches from m to m while in D . We construct σ so that P σs [switch to ξ in D ] = X a ∈ D ∩ A ¯ x a . (4.6)Then for all a ∈ D ∩ A we have freq( σ, s , a ) = P σs [switch to ξ in D ] · freq( ξ, D, a ) = ¯ x a .Finally, we obtain the following: E σs [lr inf ( ~r i )] = X a ∈ A ~r i ( a ) · ¯ x a . (4.7) The equation can be derived as follows: E σs [lr inf ( r i )] = E σs " lim inf T →∞ T T X t =1 r i ( A t ) (deﬁnition)= E σs " lim T →∞ T T X t =1 r i ( A t ) (see below)= lim T →∞ T T X t =1 E σs [ r i ( A t )] (see below)= lim T →∞ T T X t =1 X a ∈ A r i ( a ) · P σs [ A t = a ] (deﬁnition of expectation)= X a ∈ A r i ( a ) · lim T →∞ T T X t =1 P σs [ A t = a ] (linearity of the limit)= X a ∈ A r i ( a ) · freq( σ, s , a ) (deﬁnition of freq( σ, s , a ))= X a ∈ A r i ( a ) · ¯ x a . (freq( σ, s , a ) = ¯ x a )The second equality follows from the fact that the limit is almost surely deﬁned, fol-lowing from the Ergodic theorem applied to the BSCCs of the ﬁnite Markov chain G σ .The third equality holds by Lebesgue Dominated convergence theorem, because | r i ( A t ) | ≤ max a ∈ A | r i ( a ) | . Note that the right-hand side of Equation (4.7) is greater than or equal to ~v i by In-equality (4.5) of L .So, it remains to construct the strategy σ with the desired “switching” property ex-pressed by Equations (4.6). Roughly speaking, we proceed in two steps.1. We construct a ﬁnite-memory stochastic update strategy ¯ σ satisfying Equations (4.6).The strategy ¯ σ is constructed so that it initially behaves as a certain ﬁnite-memorystochastic update strategy, but eventually this mode is “switched” to the strategy ξ which is followed forever.2. The only problem with ¯ σ is that it may use more than two memory elements in general.This is solved by applying the results of [10] and reducing the “initial part” of ¯ σ (i.e.,the part before the switch) into a memoryless strategy. Thus, we transform ¯ σ into an“equivalent” strategy σ which is 2-memory stochastic update.Now we elaborate the two steps. Step 1.

For every MEC C of G , we denote by y C the number P s ∈ C ¯ y s = P a ∈ A ∩ C ¯ x a .By combining the solution of L with the results of Sections 3 and 5 of [10] one can constructa ﬁnite-memory stochastic-update strategy ζ which stays eventually in each MEC C withprobability y C . Formally, the construction is captured in the following lemma. Lemma 4.4.

Consider numbers ¯ y χ for all χ ∈ S ∪ A such that the assignment y χ := ¯ y χ is a part of some nonnegative solution to L . Then there is a ﬁnite-memory stochasticupdate strategy ζ which, starting from s , stays eventually in each MEC C with probability y C := P s ∈ C ¯ y s . DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 15

Proof.

In order to be able to use results of [10, Section 3] we modify the MDP G and obtaina new MDP G ′ as follows: For each state s we add a new absorbing state, d s . The onlyavailable action for d s leads to a loop transition back to d s with probability 1. We also adda new action, a ds , to every s ∈ S . The distribution associated with a ds assigns probability 1to d s .Let us call K the set of constraints of the LP on Figure 3 in [10]. From the values ¯ y χ we now construct a solution to K : for every state s ∈ S and every action a ∈ Act ( s ) we set y ( s,a ) := ¯ y a , and y ( s,a ds ) := ¯ y s . The values of the rest of variables in K are determined by thesecond set of equations in K . The nonnegative constraints in K are satisﬁed since ¯ y χ arenonnegative. Finally, the equations (4.1) from L imply that the ﬁrst set of equations in K are satisﬁed, because ¯ y χ are part of a solution to L .By Theorem 3.2 of [10] we thus have a memoryless strategy ̺ for G ′ which satisﬁes P s ̺ [ Reach ( d s )] ≥ y s for all s ∈ S . The strategy ζ then mimics the behavior of ̺ until themoment when ̺ chooses an action to enter some of the new absorbing states. From thatpoint on, ζ may choose some arbitrary ﬁxed behavior to stay in the current MEC (note thatif the current state s is not included in any MEC, then ¯ y s = 0 and so the strategy ̺ would notchoose to enter the new absorbing state). As a consequence: P s ζ [stay eventually in C ] ≥ y C , and in fact, we get equality here, because of the equations (4.2) from L . Note that ζ onlyneeds a ﬁnite constant amount of memory.The strategy ¯ σ works as follows. For a run initiated in s , the strategy ¯ σ plays accordingto ζ until a BSCC of G ζ is reached. This means that every possible continuation of thepath stays in the current MEC C of G . Assume that C has states s , . . . , s k . We denoteby ¯ x s the sum P a ∈ Act ( s ) ¯ x a . At this point, the strategy ¯ σ changes its behavior as follows:First, the strategy ¯ σ strives to reach s with probability one. Upon reaching s , it chooses(randomly, with probability ¯ x s y C ) either to behave as ξ forever, or to follow on to s . If thestrategy ¯ σ chooses to go on to s , it strives to reach s with probability one. Upon reaching s , the strategy ¯ σ chooses (randomly, with probability ¯ x s y C − ¯ x s ) either to behave as ξ forever,or to follow on to s , and so, till s k . That is, the probability of switching to ξ in s i is ¯ x si y C − P i − j =1 ¯ x sj .Since ζ stays in a MEC C with probability y C , the probability that the strategy ¯ σ switches to ξ in s i is equal to ¯ x s i . However, then for every BSCC D of G ξ satisfying D ∩ C = ∅ (and thus D ⊆ C ) we have that the strategy ¯ σ switches to ξ in a state of D withprobability P s ∈ D ∩ S ¯ x s = P a ∈ D ∩ A ¯ x a . Hence, ¯ σ satisﬁes Equations (4.6). Step 2.

Now we show how to reduce the ﬁrst phase of ¯ σ (before the switch to ξ ) intoa memoryless strategy, using the results of [10, Section 3]. Unfortunately, these results arenot applicable directly. We need to modify the MDP G into a new MDP G ′ , same as wedid above: For each state s we add a new absorbing state, d s . The only available action for d s leads to a loop transition back to d s with probability 1. We also add a new action, a ds ,to every s ∈ S . The distribution associated with a ds assigns probability 1 to d s .Let us consider a ﬁnite-memory stochastic-update strategy, σ ′ , for G ′ deﬁned as follows.The strategy σ ′ behaves as ¯ σ before the switch to ξ . Once ¯ σ switches to ξ , say in a state s of G with probability p s , the strategy σ ′ chooses the action a ds with probability p s . Itfollows that the probability of ¯ σ switching in s is equal to the probability of reaching d s in G ′ under σ ′ . By [10, Theorem 3.2], there is a memoryless strategy, σ ′′ , for G ′ that reaches d s with probability p s . We deﬁne σ in G to behave as σ ′′ with the exception that, in every state s , instead of choosing an action a ds with probability p s it switches to behave as ξ withprobability p s (which also means that the initial distribution on memory elements assigns p s to m ). Then, clearly, σ satisﬁes Equations (4.6) because P σs [switch in D ] = X s ∈ D P σ ′′ s h ﬁre a ds i = X s ∈ D P σ ′ s h ﬁre a ds i = P ¯ σs [switch in D ] = X a ∈ D ∩ A ¯ x a . This concludes the proof of Proposition 4.2. (cid:3)

Proposition 4.5. If ~v ∈ AcEx (lr inf ( ~r )) , then L has a nonnegative solution.Proof. Let ̺ ∈ Σ be a strategy such that E ̺s [lr inf ( ~r )] ≥ ~v . In general, the frequenciesfreq( ̺, s , a ) of the actions may not be well deﬁned, because the deﬁning limits may notexist. A crucial trick to overcome this diﬃculty is to pick suitable “related” values, f ( a ),lying between lim inf T →∞ T P Tt =1 P ̺s [ A t = a ] and lim sup T →∞ T P Tt =1 P ̺s [ A t = a ], whichcan be safely substituted for x a in L . Since every inﬁnite sequence contains an inﬁniteconvergent subsequence, there is an increasing sequence of indices, T , T , . . . , such that thefollowing limit exists for each action a ∈ Af ( a ) := lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ A t = a ] . Setting x a := f ( a ) for all a ∈ A satisﬁes Inequalities (4.5) and Equations (4.4) of L . Indeed,the former follows from E ̺s [lr inf ( ~r )] ≥ ~v and the following inequality, which holds for all1 ≤ i ≤ k : X a ∈ A ~r i ( a ) · f ( a ) ≥ E ̺s [lr inf ( ~r i )] . (4.8) The inequality follows from the following derivation: X a ∈ A r i ( a ) · f ( a ) = X a ∈ A r i ( a ) · lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ A t = a ] (deﬁnition of f ( a ))= lim ℓ →∞ T ℓ T ℓ X t =1 X a ∈ A r i ( a ) · P ̺s [ A t = a ] (linearity of the limit) ≥ lim inf T →∞ T T X t =1 X a ∈ A r i ( a ) · P ̺s [ A t = a ] (deﬁnition of lim inf) ≥ lim inf T →∞ T T X t =1 E ̺s [ r i ( A t )] (linearity of the expectation) ≥ E ̺s [lr inf ( r i )] . (see below)The last inequality is a consequence of Fatou’s lemma (see, e.g. [19, Chapter 4, Section 3]) –although the function r i ( A t ) may not be nonnegative, we can replace it with the nonnegativefunction r i ( A t ) − min a ∈ A r i ( a ) and add the subtracted constant afterwards.To prove that Equations (4.4) are satisﬁed, it suﬃces to show that for all s ∈ S we have X a ∈ A f ( a ) · δ ( a )( s ) = X a ∈ Act ( s ) f ( a ) . (4.9) DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 17

This holds, because X a ∈ A f ( a ) · δ ( a )( s ) = X a ∈ A lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ A t = a ] · δ ( a )( s ) (deﬁnition of f )= lim ℓ →∞ T ℓ T ℓ X t =1 X a ∈ A P ̺s [ A t = a ] · δ ( a )( s ) (linearity of the limit)= lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ S t +1 = s ] (deﬁnition of δ )= lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ S t = s ] (see below)= lim ℓ →∞ T ℓ T ℓ X t =1 X a ∈ Act ( s ) P ̺s [ A t = a ] ( s must be followed by a ∈ Act ( s ))= X a ∈ Act ( s ) lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ A t = a ] (linearity of the limit)= X a ∈ Act ( s ) f ( a ) . (deﬁnition of f )The fourth equality follows from the following:lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ S t +1 = s ] − lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ S t = s ] = lim ℓ →∞ T ℓ T ℓ X t =1 ( P ̺s [ S t +1 = s ] − P ̺s [ S t = s ])= lim ℓ →∞ T ℓ ( P ̺s [ S T ℓ +1 = s ] − P ̺s [ S = s ])= 0 . Now we have to set the values for y χ , χ ∈ A ∪ S , and prove that they satisfy the restof L when the values f ( a ) are assigned to x a . Note that almost every run of G ̺ eventuallystays in some MEC of G (cf., e.g., [9, Proposition 3.1]). For every MEC C of G , let y C bethe probability of all runs in G ̺ that eventually stay in C . Note that X a ∈ A ∩ C f ( a ) = X a ∈ A ∩ C lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ A t = a ]= lim ℓ →∞ T ℓ T ℓ X t =1 X a ∈ A ∩ C P ̺s [ A t = a ]= lim ℓ →∞ T ℓ T ℓ X t =1 P ̺s [ A t ∈ C ] = y C . (4.10) Here the last equality follows from the fact that lim ℓ →∞ P ̺s [ A T ℓ ∈ C ] is equal to the proba-bility of all runs in G ̺ that eventually stay in C (recall that almost every run stays eventuallyin a MEC of G ) and the fact that the Ces`aro sum of a convergent sequence is equal to thelimit of the sequence. To obtain y a and y s , we need to simplify the behavior of ̺ before reaching a MEC forwhich we use the results of [10]. As in the proof of Proposition 4.2, we ﬁrst need to modifythe MDP G into another MDP G ′ as follows: For each state s we add a new absorbing state, d s . The only available action for d s leads to a loop transition back to d s with probability 1.We also add a new action, a ds , to every s ∈ S . The distribution associated with a ds assignsprobability 1 to d s . Using the results of [10], we prove the following lemma. Lemma 4.6.

The existence of a strategy ̺ satisfying E ̺s [lr inf ( ~r )] ≥ ~v implies the existenceof a (possibly randomized) memoryless strategy ζ for G ′ such that X s ∈ C P ζs [ Reach ( d s )] = y C . (4.11) Proof.

We give a proof by contradiction. Note that the proof structure is similar to theproof of direction 3 ⇒ C , . . . C n be all MECs of G , andlet X ⊆ R n be the set of all vectors ( x , . . . , x n ) for which there is a strategy ¯ σ in G ′ such that P ¯ σs (cid:2)S s ∈ C i Reach ( d s ) (cid:3) ≥ x i for all 1 ≤ i ≤ n . For a contradiction, suppose( y C , . . . , y C n ) X . By [10, Theorem 3.2] the set X can be described as a set of solutions ofa linear program, and hence it is convex. By the separating hyperplane theorem [2] thereare weights w , . . . , w n such that P ni =1 y C i · w i > P ni =1 x i · w i for every ( x , . . . , x n ) ∈ X .We deﬁne a reward function r by r ( a ) = w i for an action a from C i , where 1 ≤ i ≤ n ,and r ( a ) = 0 for actions not in any MEC. Observe that the mean payoﬀ of any run thateventually stays in a MEC C i is w i , and so the expected mean payoﬀ w.r.t. r under ̺ is P ni =1 y C i · w i . Because memoryless deterministic strategies suﬃce for maximising the(single-objective) expected mean payoﬀ, there is also a memoryless deterministic strategy ˆ σ for G that yields expected mean payoﬀ w.r.t. r equal to z ≥ P ni =1 y C i · w i . We now deﬁnea strategy ¯ σ for G ′ to mimic ˆ σ until a BSCC is reached, and when a BSCC is reached, sayalong a path w , the strategy ¯ σ takes the action a d last ( w ) . Let x i = P ¯ σs (cid:2)S s ∈ C i Reach ( d s ) (cid:3) .Due to the construction of ¯ σ we have x i is equal to the probability of runs that eventuallystay in C i under ˆ σ : this follows because once a BSCC is reached on a path w , every run ω extending w has an inﬁnite suﬃx containing only states from the MEC containing thestate last ( w ). Hence P ni =1 x i · w i = z . However, by the choice of the weights w i we get that( x , . . . , x n ) X , and hence a contradiction, because ¯ σ witnesses that ( x , . . . , x n ) ∈ X .Hence, we have obtained that there is some (possibly memory-dependent) strategy ζ ,and using [10, Theorem 3.2] we get that there also is a memoryless strategy ζ with therequired properties. This completes the proof of Lemma 4.6.We now proceed with the proof of Proposition 4.5. Let U a be a function over the runsin G ′ returning the (possibly inﬁnite) number of times the action a is used. We are nowready to deﬁne the assignment for the variables y χ of L . y a := E ζs [ U a ] for all a ∈ Ay s := E ζs h U a ds i = P ζs [ Reach ( d s )] for all s ∈ S. Note that [10, Lemma 3.3] ensures that all y a and y s are indeed well-deﬁned ﬁnite values,and satisfy Equations (4.1) of L . Equations (4.3) of L are satisﬁed due to Equations (4.11)and (4.10). Equations (4.11) together with P a ∈ A ∩ C f ( a )=1 imply Equations (4.2) of L .This completes the proof of Proposition 4.5. DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 19

The item A.1 in Section 3.1 follows directly from Theorem 4.1. Let us analyze A.2.Suppose ~v is a point of the Pareto curve. Consider the system L ′ of linear inequalitiesobtained from L by replacing constants ~v i in Inequalities (4.5) with new variables z i . Let Q ⊆ R n be the projection of the set of solutions of L ′ to z , . . . , z n . From Theorem 4.1 andthe deﬁnition of Pareto curve, the (Euclidean) distance of ~v to Q is 0. Because the set ofsolutions of L ′ is a closed set, Q is also closed and thus ~v ∈ Q . This gives us a solution to L with variables z i having values ~v i , and we can use Theorem 4.1 to get a strategy witnessingthat ~v ∈ AcEx (lr inf ( ~r )).Now consider the items A.3 and A.4. The system L is linear, and hence the problemwhether ~v ∈ AcEx (lr inf ( ~r )) is decidable in polynomial time by employing polynomial-timealgorithms for linear programming. A 2-memory stochastic-update strategy σ satisfying E σs [lr inf ( ~r )] ≥ ~v can be computed as follows (note that the proof of Proposition 4.2 is not fullyconstructive, so we cannot apply this proposition immediately). First, we ﬁnd a solution ofthe system L , and we denote by ¯ x a the value assigned to x a . Let ( T , B ) , . . . , ( T n , B n ) bethe end components such that a ∈ S ni =1 B i iﬀ ¯ x a >

0, and T , . . . , T n are pairwise disjoint.We construct another system of linear inequalities consisting of Equations (1) of L and theequations P s ∈ T i y s = P s ∈ T i P a ∈ Act ( s ) ¯ x a for all 1 ≤ i ≤ n . Due to [10], there is a solutionto this system iﬀ in the MDP G ′ from the proof of Proposition 4.2 there is a strategy thatfor every i reaches d s for s ∈ T i with probability P s ∈ T i P a ∈ Act ( s ) ¯ x a . Such a strategy indeedexists (consider, e.g., the strategy σ ′ from the proof of Proposition 4.2). Thus, there is asolution to the above system and we can denote by ˆ y s and ˆ y a the values assigned to y s and y a . We deﬁne σ by σ n ( s, m )( a ) = ¯ y a / P a ′ ∈ Act ( s ) ¯ y a ′ σ n ( s, m )( a ) = ¯ x a / P a ′ ∈ Act ( s ) ¯ x a ′ and further σ u ( a, s, m )( m )= y s , σ u ( a, s, m )( m )=1 , and the initial memory distributionassigns (1 − y s ) and y s to m and m , respectively. Due to [10] we have P σs [change memory to m in s ] = ˆ y s , and the rest follows similarly as in the proof of Proposition 4.2.The item A.5 can be proved as follows: To test that ~v ∈ AcEx (lr inf ( ~r )) lies in the Paretocurve we turn the system L into a linear program LP by adding the objective to maximize P ≤ i ≤ k P a ∈ A x a · ~r i ( a ) . Then we check that there is no better solution than P ≤ i ≤ k ~v i .Finally, the item A.6 is obtained by considering the system L ′ above and computing allexponentially many vertices of the polytope of all solutions. Then we compute projections ofthese vertices onto the dimensions z , . . . , z n and retrieve all the maximal vertices. Moreover,if for every ~v ∈ { ℓ · ε | ℓ ∈ Z ∧ − M r ≤ ℓ · ε ≤ M r } k where M r = max a ∈ A max ≤ i ≤ k | ~r i ( a ) | wedecide whether ~v ∈ AcEx (lr inf ( ~r )), we can easily construct an ε -approximate Pareto curve.4.1. Deterministic-update Strategies for Expectation Objectives.

We now showthat for expectation objectives, ﬁnite-memory deterministic update strategies suﬃce. Thisis captured in the following proposition.

Proposition 4.7.

Every nonnegative solution of the system L induces a ﬁnite-memorydeterministic-update strategy σ satisfying E σs [lr inf ( ~r )] ≥ ~v .Proof. The proof proceeds almost identically to the proof of Proposition 4.2. Let us recallthe important steps from that proof ﬁrst. There we worked with the numbers ¯ x a , a ∈ A , which, assigned to the variables x a , formed a part of the solution to L . We also workedwith two important strategies. The ﬁrst one, a ﬁnite-memory deterministic-update strategy ζ , made sure that, starting in s , a run stays in a MEC C forever with probability y C = P a ∈ A ∩ C ¯ x a . The second one, a memoryless strategy σ ′ , had the property that when thestarting distribution was α ( s ) := ¯ x s = P a ∈ Act ( s ) ¯ x a then E σ ′ α [lr inf ( ~r )] ≥ ~v . To producethe promised ﬁnite-memory deterministic-update strategy σ we now have to combine thestrategies ζ and σ ′ using only deterministic memory updates.We now deﬁne the strategy σ . It works in three phases. First, it reaches every MEC C and stays in it with the probability y C . Second, it prepares the distribution α , and ﬁnallythird, it switches to σ ′ . It is clear how the strategy is deﬁned in the third phase. As for theﬁrst phase, this is also identical to what we did in the proof of Proposition 4.2 for ¯ σ : Thestrategy σ follows the strategy ζ from beginning until in the associated ﬁnite state Markovchain G ζ a bottom strongly connected component (BSCC) is reached. At that point the runhas already entered its ﬁnal MEC C to stay in it forever, which happens with probability y C . The last thing to solve is thus the second phase. Two cases may occur. Either thereis a state s ∈ C such that | Act ( s ) ∩ C | >

1, i.e., there are at least two actions the strategycan take from s without leaving C . Let us denote these actions a and b . Consider anenumeration C = { s , . . . , s k } of vertices of C . Now we deﬁne the second phase of σ whenin C . We start with deﬁning the memory used in the second phase. We symbolicallyrepresent the possible contents of the memory as { Wait , . . . , Wait k , Switch , . . . , Switch k } .The second phase then starts with the memory set to Wait . Generally, if the memory isset to Wait i then σ aims at reaching s with probability 1. This is possible (since s is inthe same MEC) and it is a well known fact that it can be done without using memory. Onvisiting s , the strategy chooses the action a with probability ¯ x s i / ( y C − P i − j =1 ¯ x s j ) and theaction b with the remaining probability. In the next step the deterministic update functionsets the memory either to Switch i or Wait i +1 , depending on whether the last action seenis a or b , respectively. (Observe that if i = k then the probability of taking b is 0.) Thememory set to Switch i means that the strategy aims at reaching s i almost surely, and upondoing so, the strategy switches to the third phase, following σ ′ . It is easy to observe that onthe condition of staying in C the probability of switching to the third phase in some s i ∈ C is ¯ x s i /y C , thus the unconditioned probability of doing so is ¯ x s i , as desired.The remaining case to solve is when | Act ( s ) ∩ C | = 1 for all s ∈ C . But then switchingto the third phase is solved trivially with the right probabilities, because staying in C inevitably already means mimicking σ ′ .5. Solution for Satisfaction Objectives

In this section we prove the items B.1–B.6 of Section 3.2. Let us ﬁx an MDP G , a vectorof rewards, ~r = ( r , . . . , r k ), and an initial state s . We start by assuming that the MDP G is strongly connected (i.e., ( S, A ) is an end component).

Proposition 5.1.

Assume that G is strongly connected and that there is a strategy π suchthat P πs [lr inf ( ~r ) ≥ ~v ] > . Then the following is true. Here we extend the notation in a straightforward way from a single initial state to a general initialdistribution, α . DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 21 There is a strategy ξ satisfying P ξs [lr inf ( ~r ) ≥ ~v ] = 1 for all s ∈ S . For each ε> there is a memoryless randomized strategy ξ ε that for all s ∈ S satisﬁes P ξ ε s [lr inf ( ~r ) ≥ ~v − ~ε ] = 1 .Moreover, the problem whether there is some π such that P πs [lr inf ( ~r ) ≥ ~v ] > is decidablein polynomial time. Strategies ξ ε are computable in time polynomial in the size of G , thesize of the binary representation of ~r , and ε .Proof. By [6, 13] we get that P πs [lr inf ( ~r ) ≥ ~v ] > ξ suchthat P ξs [lr inf ( ~r ) ≥ ~v ] = 1: Since lr inf ( ~r ) ≥ ~v is a tail or preﬁx-independent function, itfollows from the results of [6] that if P πs [lr inf ( ~r ) ≥ ~v ] >

0, then there exists a state s in theMDP with value 1, i.e., there exists s such that sup π P πs [lr inf ( ~r ) ≥ ~v ] = 1. It follows fromthe results of [13] that in MDPs with tail functions, optimal strategies exist and thus itfollows that there exist a strategy π from s such that P π s [lr inf ( ~r ) ≥ ~v ] = 1. Since the MDPis strongly connected, the state s can be reached with probability 1 from s by a strategy π .Hence the strategy π , followed by the strategy π after reaching s , is the witness strategy π ′ such that P π ′ s [lr inf ( ~r ) ≥ ~v ] = 1.This gives us item 1. of Proposition 5.1 and also immediately implies ~v ∈ AcEx (lr inf ( ~r )).It follows that there are nonnegative values ¯ x a for all a ∈ A such that assigning ¯ x a to x a solves Equations (4.4) and (4.5) of the system L (see Figure 3). Let us assume, w.l.o.g.,that P a ∈ A ¯ x a = 1.Lemma 4.3 gives us a memoryless randomized strategy ζ such that for all BSCCs D of G ζ , all s ∈ D ∩ S and all a ∈ D ∩ A we have that freq( ζ, s, a ) = ¯ x a P a ∈ D ∩ A ¯ x a . We denote byfreq( ζ, D, a ) the value ¯ x a P a ∈ D ∩ A ¯ x a .Now we are ready to prove the item 2 of Proposition 5.1. Let us ﬁx ε >

0. Weobtain ξ ε by a suitable perturbation of the strategy ζ in such a way that all actions getpositive probabilities and the frequencies of actions change only slightly. There exists anarbitrarily small (strictly) positive solution x ′ a of Equations (4.4) of the system L (it suﬃcesto consider a strategy τ which always takes the uniform distribution over the actions inevery state and then assign freq( τ, s , a ) /N to x a for suﬃciently large N ). As the systemof Equations (4.4) is linear and homogeneous, assigning ¯ x a + x ′ a to x a also solves thissystem and Lemma 4.3 gives us a strategy ξ ε satisfying freq( ξ ε , s , a ) = (¯ x a + x ′ a ) /X where X = P a ′ ∈ A ¯ x a ′ + x ′ a ′ = 1 + P a ′ ∈ A x ′ a ′ . We may safely assume that P a ′ ∈ A x ′ a ′ ≤ ε · M r where M r = max a ∈ A max ≤ i ≤ k | ~r i ( a ) | . Thus, we obtain X a ∈ A freq( ξ ε , s , a ) · ~r i ( a ) ≥ ~v i − ε (5.1) by the following sequence of (in)equalities. X a ∈ A freq( ξ ε , s , a ) · ~r i ( a )= X a ∈ A ¯ x a + x ′ a X · ~r i ( a ) (def)= 1 X · X a ∈ A ¯ x a · ~r i ( a ) + 1 X · X a ∈ A x ′ a · ~r i ( a ) (rearranging)= (cid:16) X a ∈ A ¯ x a · ~r i ( a ) + 1 − XX · X a ∈ A ¯ x a · ~r i ( a ) (cid:17) + 1 X · X a ∈ A x ′ a · ~r i ( a ) (rearranging) ≥ X a ∈ A ¯ x a · ~r i ( a ) − (cid:12)(cid:12)(cid:12) − XX · X a ∈ A ¯ x a · ~r i ( a ) (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) X · X a ∈ A x ′ a · ~r i ( a ) (cid:12)(cid:12)(cid:12) (property of abs. value) ≥ X a ∈ A ¯ x a · ~r i ( a ) − (cid:16)(cid:12)(cid:12)(cid:12) (1 − X ) · X a ∈ A ¯ x a · ~r i ( a ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) X a ∈ A x ′ a · ~r i ( a ) (cid:12)(cid:12)(cid:12)(cid:17) (from X > ≥ X a ∈ A ¯ x a · ~r i ( a ) − (cid:16) (1 − X ) · X a ∈ A ¯ x a · | ~r i ( a ) | + X a ∈ A x ′ a · | ~r i ( a ) | (cid:17) (prop. of |·| and X > ≥ X a ∈ A ¯ x a · ~r i ( a ) − (cid:16) (1 − X ) · M r + X a ∈ A x ′ a · M r (cid:17) (property of M r ) ≥ X a ∈ A ¯ x a · ~r i ( a ) − (cid:18)(cid:16) X a ∈ A x ′ a (cid:17) · M r + (cid:16) X a ∈ A x ′ a (cid:17) · M r (cid:19) (property of X and rearranging)= X a ∈ A ¯ x a · ~r i ( a ) − · (cid:16) X a ∈ A x ′ a (cid:17) · M r (rearranging) ≥ ~v i − · (cid:16) X a ∈ A x ′ a (cid:17) · M r (property of ~v ) ≥ ~v i − ε (property of ε ) As G ξ ε is strongly connected, almost all runs ω of G ξ ε initiated in s satisfylr inf ( ~r )( ω ) = X a ∈ A freq( ξ ε , s , a ) · ~r ( a ) ≥ ~v − ~ε. This ﬁnishes the proof of item 2.Concerning the complexity of computing ξ ε , note that the binary representation of everycoeﬃcient in L has only polynomial length. As ¯ x a ’s are obtained as a solution of (a part of) L , standard results from linear programming imply that each ¯ x a has a binary representationcomputable in polynomial time. The numbers x ′ a are also obtained by solving a part of L and restricted by (cid:12)(cid:12)P a ′ ∈ A x ′ a ′ (cid:12)(cid:12) ≤ ε · M r which allows to compute a binary representation of x ′ a in polynomial time. The strategy ξ ε , deﬁned in the proof of Proposition 5.1, assigns toeach action only small arithmetic expressions over ¯ x a and x ′ a . Hence, ξ ε is computable inpolynomial time.To prove that the problem whether there is some ξ such that P ξs [lr inf ( ~r ) ≥ ~v ] > ~v ∈ AcEx (lr inf ( ~r )), then (1 , ~v ) ∈ AcSt (lr inf ( ~r )). This gives us a polynomial-time algorithm by applying Theorem 4.1. Let ~v ∈ AcEx (lr inf ( ~r )). We show that there is a strategy ξ such that P ξs [lr inf ( ~r ) ≥ ~v ] = 1. DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 23

Since ~v ∈ AcEx (lr inf ( ~r )), there are nonnegative rational values ¯ x a for all a ∈ A suchthat assigning ¯ x a to x a solves Equations (4.4) and (4.5) of the system L . Assume, withoutloss of generality, that P a ∈ A ¯ x a = 1.Given a ∈ A , let I a : A → { , } be a function given by I a ( a ) = 1 and I a ( b ) = 0 forall b = a . For every i ∈ N , we denote by ξ i a memoryless randomized strategy satisfying P ξ i s (cid:2) lr inf ( I a ) ≥ ¯ x a − − i − (cid:3) = 1. Note that for every i ∈ N there is κ i ∈ N such that for all a ∈ A and s ∈ S we get P ξ i s " inf T ≥ κ i T T X t =0 I a ( A t ) ≥ ¯ x a − − i ≥ − − i . Now let us consider a sequence n , n , . . . of numbers where n i ≥ κ i and P j ℓ . We need the following inequality1 T T X t =0 I a ( a t ) ≥ (¯ x a − − i )(1 − − i ) (5.2) which can be proved as follows. First, note that1 T T X t =0 I a ( a t ) ≥ T N i − X t = N i − I a ( a t ) + 1 T T X t = N i I a ( a t )and that 1 T N i − X t = N i − I a ( a t ) = 1 n i N i − X t = N i − I a ( a t ) · n i T ≥ (¯ x a − − i ) n i T which gives 1 T T X t =0 I a ( a t ) ≥ (¯ x a − − i ) n i T + 1 T T X t = N i I a ( a t ) . (5.3)Now, we distinguish two cases. First, if T − N i ≤ κ i +1 , then n i T ≥ n i N i − + n i + κ i +1 = 1 − N i − + κ i +1 N i − + n i + κ i +1 ≥ (1 − − i )and thus, by Equation (5.3),1 T T X t =0 I a ( a t ) ≥ (¯ x a − − i )(1 − − i ) . Second, if T − N i ≥ κ i +1 , then1 T T X t = N i +1 I a ( a t ) = 1 T − N i T X t = N i +1 I a ( a t ) · T − N i T ≥ (¯ x a − − i − ) (cid:18) − N i − + n i T (cid:19) ≥ (¯ x a − − i − ) (cid:16) − − i − n i T (cid:17) and thus, by Equation (5.3),1 T T X t =0 I a ( a t ) ≥ (¯ x a − − i ) n i T + (¯ x a − − i − ) (cid:16) − − i − n i T (cid:17) ≥ (¯ x a − − i ) (cid:16) n i T + (cid:16) − − i − n i T (cid:17)(cid:17) ≥ (¯ x a − − i )(1 − − i )which ﬁnishes the proof of Equation (5.2).Since the sum in Equation (5.2) converges to ¯ x a as i (and thus also T ) goes to ∞ , weobtain lim inf T →∞ T T X t =0 I a ( a t ) ≥ ¯ x a . DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 25 s s a a b b Figure 4.

MDP showing the need of inﬁnite memory.The strategy ξ from the proof of Proposition 5.1 required inﬁnite memory. We showthat this may indeed be necessary, i.e. it can be the case that ( ν, ~v ) ∈ AcSt (lr inf ( ~r )) althoughthere is no ﬁnite-memory strategy σ satisfying P σs [lr inf ( ~r ) ≥ ~v ] > ν (and in fact not evenﬁnite-memory strategy satisfying P σs [lr inf ( ~r ) ≥ ~v ] > r i (for i ∈ { , } ) returns 1 for b i and 0 for all other actions.Let s be the initial vertex. It is easy to see that (0 . , . ∈ AcEx (lr inf ( ~r )): consider forexample a strategy that ﬁrst chooses both available actions in s with uniform probabilities,and in subsequent steps chooses self-loops on s or s deterministically. From the resultspresented above we subsequently get that (1 , . , . ∈ AcSt (lr inf ( ~r )).On the other hand, let σ be arbitrary ﬁnite-memory strategy. The Markov chain itinduces is by deﬁnition ﬁnite and for each of its BSCC C we have the following. One of thefollowing then takes place: • C contains both s and s . Then by Ergodic theorem for almost every run ω we havelr inf ( I a )( ω ) + lr inf ( I a )( ω ) >

0, which means that lr inf ( I b )( ω ) + lr inf ( I b )( ω ) <

1, andthus necessarily lr inf ( ~r )( ω ) (0 . , . • C contains only the state s (resp. s ), in which case all runs that enter it satisfylr inf ( ~r )( ω ) = (1 ,

0) (resp. lr inf ( ~r )( ω ) = (0 , P σs [lr inf ( ~r ) ≥ (0 . , . ε -optimal strategies are not necessarily memoryless pure,as the following lemma shows. Lemma 5.2.

There is an MDP G a vector of reward functions ~r = ( r , r ) , a number ε > and a vector ( ν, ~v ) ∈ AcSt (lr inf ( ~r )) such that there is no memoryless-pure strategy σ satisfying P σs [lr inf ( ~r ) ≥ ~v − ~ε ] > ν − ~ε .Proof. We can reuse G and ~r showing the need of inﬁnite memory for optimal strategies. Welet ν = 1 and ~v = (0 . , . ν, ~v ) ∈ AcSt (lr inf ( ~r )). Taking e.g. ε = 0 . P σs [lr inf ( ~r ) ≥ ~v − ~ε ] >ν − ~ε .We are now ready to prove the items B.1, B.3 and B.4. Let C , . . . , C ℓ be all MECs of G . We say that a MEC C i is good for ~v if there is a state s of C i and a strategy π satisfying P πs [lr inf ( ~r ) ≥ ~v ] > C i when starting in s . Using Proposition 5.1, we candecide in polynomial time whether a given MEC is good for a given ~v . Let C be the union ofall MECs good for ~v . Then, by Proposition 5.1, there is a strategy ξ such that for all s ∈ C we have P ξs [lr inf ( ~r ) ≥ ~v ] = 1 and for each ε > ξ ε , computable in polynomial time, such that for all s ∈ C we have P ξ ε s [lr inf ( ~r ) ≥ ~v − ~ε ] = 1.Consider a strategy τ , computable in polynomial time, which maximizes the probabilityof reaching C . Denote by σ a strategy which behaves as τ before reaching C and as ξ afterwards. Similarly, denote by σ ε a strategy which behaves as τ before reaching C and as ξ ε afterwards. Note that σ ε is computable in polynomial time. Clearly, ( ν, ~v ) ∈ AcSt (lr inf ( ~r )) iﬀ P τs [ Reach ( C )] ≥ ν because σ achieves ~v with probabil-ity P τs [ Reach ( C )]. Thus, we obtain that ν ≤ P τs [ Reach ( C )] ≤ P ξ ε s [lr inf ( ~r ) ≥ ~v − ~ε ].Finally, in order to decide whether ( ν, ~v ) ∈ AcSt (lr inf ( ~r )), it suﬃces to decide whether P τs [ Reach ( C )] ≥ ν in polynomial time.Now we prove item B.2. Suppose ( ν, ~v ) is a vector of the Pareto curve. We let C bethe union of all MECs good for ~v . Recall that the Pareto curve constructed for expectationobjectives is achievable (item A.2). Due to the correspondence between AcSt and

AcEx in strongly connected MDPs we obtain the following. There is λ > D not contained in C , every s ∈ D , and every strategy σ that does not leave D , it ispossible to have P σs [lr inf ( ~r ) ≥ ~u ] > i such that ~v i − ~u i ≥ λ , i.e., when ~v isgreater than ~u by λ in some component. Thus, for every ε < λ and every strategy σ suchthat P σs [lr inf ( ~r ) ≥ ~v − ~ε ] ≥ ν − ε it must be the case that P σs [ Reach ( C )] ≥ ν − ε . Becausefor single objective reachability the optimal strategies exist, we get that there is a strategy τ satisfying P τs [ Reach ( C )] ≥ ν , and by using methods similar to the ones of the previousparagraphs we obtain ( ν, ~v ) ∈ AcSt (lr inf ( ~r )).The polynomial-time algorithm mentioned in item B.5 works as follows. First checkwhether ( ν, ~v ) ∈ AcSt (lr inf ( ~r )) and if not, return “no”. Otherwise, ﬁnd all MECs goodfor ~v and compute the maximal probability of reaching them from the initial state. If theprobability is strictly greater than ν , return “no”. Otherwise, continue by performing thefollowing procedure for every 1 ≤ i ≤ k , where k is the dimension of ~v : Find all MECs C for which there is ε > C is good for ~u , where ~u is obtained from ~v by increasingthe i -th component by ε (this can be done in polynomial time using linear programming).Compute the maximal probability of reaching these MECs. If for any i the probability is at least ν , return “no”, otherwise return “yes”.The ﬁrst claim of B.6 follows from Running example (II). We prove that the set N := { ν | ( ν, ~v ) ∈ P } , where P is the Pareto curve for AcSt (lr inf ( ~r )), is indeed ﬁnite. As we alreadyshowed, for every ﬁxed ~v there is a union C of MECs good for ~v , and ( ν, ~v ) ∈ AcSt (lr inf ( ~r ))iﬀ the C can be reached with probability at least ν . Hence | N | ≤ | G | , because the latter isan upper bound on a number of unions of MECs in G .To prove the other claims, let N be the set { ν | ( ν, ~v ) ∈ P } where P is the Pareto curvefor AcSt (lr inf ( ~r )).Let us consider a ﬁxed ν ∈ N . This gives us a collection R ( ν ) of all unions C of MECswhich can be reached with probability at least ν . For a MEC C let Sol ( C ) be the set AcEx (lr inf ( ~r )) of the MDP given by restricting G to C . Further, for every C ∈ R ( ν ) we set Sol ( C ) := T C ∈C Sol ( C ) . Finally,

Sol ( R ( ν )) := S C∈ R ( ν ) Sol ( C ) . From the analysis above wealready know that

Sol ( R ( ν )) = { ~v | ( ν, ~v ) ∈ AcSt (lr inf ( ~r ) } . As a consequence, ( ν, ~v ) ∈ P iﬀ ν ∈ N and ~v is maximal in Sol ( R ( ν )) and ~v / ∈ Sol ( R ( ν ′ )) for any ν ′ ∈ N, ν ′ > ν . In otherwords, P is also the Pareto curve of the set Q := { ( ν, ~v ) | ν ∈ N, ~v ∈ Sol ( R ( ν )) } . Observethat Q is a ﬁnite union of downward closures of bounded convex polytopes, because every Sol ( C ) is a bounded convex polytope. Finally, observe that N can be computed using thealgorithms for optimizing single-objective reachability. Further, the inequalities deﬁning Sol ( C ) can also be computed using our results on AcEx . By a generalised convex polytope we denote a set of points described by a ﬁnite conjunction of linear inequalities, which maybe both strict and non-strict.

DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 27

Claim 5.3.

Let X be a generalised convex polytope. The smallest convex polytope con-taining X is its closure, cl ( X ). Moreover, the set cl ( X ) \ X is a union of some of the facetsof cl ( X ). Proof.

Let I by the set of inequalities deﬁning X , and denote by I ′ the modiﬁcation of thisset where all the inequalities are transformed to non-strict ones. The closure cl ( X ) indeedis a convex polytope, as it is described by I ′ . Since every convex polytope is closed, if itcontains X then it must contain also its closure. Thus cl ( X ) is the smallest one containing X . Let α < β be a strict inequality from I . By I ′ ( α = β ) we denote the set I ′ ∪ { α = β } .The points of cl ( X ) \ X form a union of convex polytopes, each one given by the set I ′ ( α = β )for some α < β ∈ I . Thus, it is a union of facets of cl ( X ).The following lemma now ﬁnishes the proof of B.6: Lemma 5.4.

Let Q be a ﬁnite union of bounded convex polytopes, Q , . . . , Q m . Thenits Pareto curve P is a ﬁnite union of bounded generalised convex polytopes, P , . . . , P n .Moreover, if the inequalities describing Q i are given, then the inequalities describing P i canbe computed.Proof. We proceed by induction on the number m of components of Q . If m = 0 then P = ∅ is clearly a bounded convex polytope easily described by arbitrary two incompatibleinequalities. For m ≥ Q ′ := S m − i =1 Q i . By the induction hypothesis,the Pareto curve of Q ′ is some P ′ := S n ′ i =1 P i where every P i , 1 ≤ i ≤ n ′ is a boundedgeneralised convex polytope, described by some set of linear inequalities. Denote by dom ( X )the (downward closed) set of all points dominated by some point of X . Observe that P ,the Pareto curve of Q , is the union of all points which either are maximal in Q m and donot belong to dom ( P ′ ) (observe that dom ( P ′ ) = dom ( Q ′ )), or are in P ′ and do not belongto dom ( Q m ). In symbols: P = (maximal from Q m \ dom ( P ′ )) ∪ ( P ′ \ dom ( Q m )) . The set dom ( P ′ ) of all ~x for which there is some ~y ∈ P ′ such that ~y ≥ ~x is a union of projec-tions of generalised convex polytopes – just add the inequalities from the deﬁnition of each P i instantiated with ~y to the inequality ~y ≥ ~x , and remove ~x by projecting. Thus, dom ( P ′ ) is aunion of generalised convex polytopes itself. A diﬀerence of two generalised convex polytopesis a union of generalised convex polytopes. Thus the set “maximal from Q m \ dom ( P ′ )” is aunion of generalised bounded convex polytopes, and for the same reasons so is P ′ \ dom ( Q m ).Finally, let us show how to compute P . This amounts to computing the projection,and the set diﬀerence. For convex polytopes, eﬃcient computing of projections is a problemstudied since the 19th century. One of possible approaches, non-optimal from the complexitypoint of view, but easy to explain, is by traversing the vertices of the convex polytope andprojecting them individually, and then taking the convex hull of those vertices. To computea projection of a generalised convex polytope X , we ﬁrst take its closure cl ( X ), and projectthe closure. Then we traverse all the facets of the projection and mark every facet to whichat least one point of X projected. This can be veriﬁed by testing whether the inequalitiesdeﬁning the facet in conjunction with the inequalities deﬁning X have a solution. Finally,we remove from the projection all facets which are not marked. Due to Claim 5.3, thediﬀerence of the projection of cl ( X ) and the projection of X is a union of facets. Everyfacet from the diﬀerence has the property that no point from X is projected to it. Thus weobtained the projection of X . Computing the set diﬀerence of two bounded generalised convex polytopes is easier:Consider we have two polytopes, given by sets I and I of inequalities. Then subtractingthe second generalised convex polytope from the ﬁrst is the union of generalised polytopesgiven by the inequalities I ∪ { α ⊀ β } , where α ≺ β ranges over all inequalities (strict ornon-strict) in I .6. A Note on Equivalence of Definitions of Strategies

In this section we argue that the deﬁnitions of strategies as functions ( SA ) ∗ S → dist ( A )and as triples ( σ u , σ n , α ) are interchangeable.Note that formally a strategy π : ( SA ) ∗ S → dist ( A ) gives rise to a Markov chain G π with states ( SA ) ∗ S and transitions w σ ( w )( a ) · δ ( a )( s ) → was for all w ∈ ( SA ) ∗ S , a ∈ A and s ∈ S . Given σ = ( σ u , σ n , α ) and a run w = ( s , m , a )( s , m , a ) . . . of G σ denote w [ i ] = s a s a . . . s i − a i − s i . We deﬁne f ( w ) = w [0] w [1] w [2] . . . .We need to show that for every strategy σ = ( σ u , σ n , α ) there is a strategy π : ( SA ) ∗ S → dist ( A ) (and vice versa) such that for every set of runs W of G π we have P σs (cid:2) f − ( W ) (cid:3) = P πs [ W ]. We only present the construction of strategies and basic arguments, the technicalpart of the proof is straightforward.Given π : ( SA ) ∗ S → dist ( A ), one can easily deﬁne a deterministic-update strategy σ = ( σ u , σ n , α ) which uses memory ( SA ) ∗ S . The initial memory element is the initial state s , the next move function is deﬁned by σ ( s, w ) = π ( w ), and the memory update function σ u is deﬁned by σ u ( a, s, w ) = was . Reader can observe that there is a naturally deﬁnedbijection between runs in G π and in G σ , and that this bijection preserves probabilities ofsets of runs.In the opposite direction, given σ = ( σ u , σ n , α ), we deﬁne π : ( SA ) ∗ S → dist ( A ) asfollows. Given w = s a . . . s n − a n − s n ∈ ( SA ) ∗ S and a ∈ A , we denote by U wa the set ofall paths in G σ that have the form( s , m , a )( s , m , a ) . . . ( s n − , m n − , a n )( s n , m n , a )for some m , . . . m n . We put π ( w )( a ) = P σs [ U wa ] P a ′∈ A P σs [ U wa ′ ] . The key observation for the proof ofcorrectness of this construction is that the probability of U wa in G σ is equal to probabilityof taking a path w and then an action a in G π .7. Conclusions

In this paper we have studied the problem of determining whether for a given MDP thereexists a strategy achieving a certain value in each of multiple given limit-average objectivefunctions. We have concentrated on two diﬀerent interpretations of the functions, namelythe expectation objectives and satisfaction objectives, and provided algorithms solving theproblem.The next step in this line of research is to implement and evaluate the algorithms. Onthe theoretical side, one could further study the problem of existence of a strategy thatsimultaneously satisﬁes several expectation objective and satisfaction objectives, or evencombine the limit-average functions with diﬀerent kinds of functions, such as ω -regularobjectives or cumulative reward objectives. DP WITH MULTIPLE LONG-RUN AVERAGE OBJECTIVES 29

Acknowledgements.

The authors thank David Parker and Dominik Wojtczak for initialdiscussions on the topic. T. Br´azdil is supported by the Czech Science Foundation, grantNo P202/12/P612. K. Chatterjee is supported by the Austrian Science Fund (FWF) GrantNo P 23499-N23; FWF NFN Grant No S11407-N23 (RiSE); ERC Start grant (279307:Graph Games); Microsoft faculty fellows award. V. Forejt is supported by a Royal SocietyNewton Fellowship and EPSRC project EP/J012564/1.

References [1] E. Altman.

Constrained Markov Decision Processes (Stochastic Modeling) . Chapman & Hall/CRC, 1999.[2] S. Boyd and L. Vandenberghe.

Convex Optimization . Cambridge Univ. Press, 2004.[3] T. Br´azdil, V. Broˇzek, K. Chatterjee, V. Forejt, and A. Kuˇcera. Two views on multiple mean-payoﬀobjectives in markov decision processes. In

LICS , pages 33–42. IEEE Computer Society, 2011.[4] T. Br´azdil, V. Broˇzek, and K. Etessami. One-counter stochastic games. In K. Lodaya and M. Maha-jan, editors,

FSTTCS , volume 8 of

LIPIcs , pages 108–119. Schloss Dagstuhl - Leibniz-Zentrum f¨urInformatik, 2010.[5] T. Br´azdil, V. Broˇzek, V. Forejt, and A. Kuˇcera. Reachability in recursive markov decision processes.

Inf. Comput. , 206(5):520–537, 2008.[6] K. Chatterjee. Concurrent games with tail objectives.

Theor. Comput. Sci. , 388:181–198, December2007.[7] K. Chatterjee. Markov decision processes with multiple long-run average objectives. In V. Arvindand S. Prasad, editors,

FSTTCS , volume 4855 of

Lecture Notes in Computer Science , pages 473–484.Springer, 2007.[8] K. Chatterjee, R. Majumdar, and T. A. Henzinger. Markov decision processes with multiple objectives.In B. Durand and W. Thomas, editors,

STACS , volume 3884 of

Lecture Notes in Computer Science ,pages 325–336. Springer, 2006.[9] C. Courcoubetis and M. Yannakakis. Markov decision processes and regular events.

Automatic Control,IEEE Transactions on , 43(10):1399–1418, Oct. 1998.[10] K. Etessami, M. Kwiatkowska, M. Vardi, and M. Yannakakis. Multi-objective model checking of Markovdecision processes.

LMCS , 4(4):1–21, 2008.[11] J. Filar and K. Vrieze.

Competitive Markov Decision Processes . Springer-Verlag, 1997.[12] V. Forejt, M. Z. Kwiatkowska, G. Norman, D. Parker, and H. Qu. Quantitative multi-objective veriﬁ-cation for probabilistic systems. In P. A. Abdulla and K. R. M. Leino, editors,

TACAS , volume 6605 of

Lecture Notes in Computer Science , pages 112–127. Springer, 2011.[13] H. Gimbert and F. Horn. Solving simple stochastic tail games. In M. Charikar, editor,

SODA , pages847–862. SIAM, 2010.[14] J. Koski. Multicriteria truss optimization. In W. Stadler, editor,

Multicriteria Optimization in Engi-neering and in the Sciences . Plenum Press, 1988.[15] J. R. Norris.

Markov chains . Cambridge University Press, 1998.[16] G. Owen.

Game Theory . Academic Press, 1995.[17] C. H. Papadimitriou and M. Yannakakis. On the approximability of trade-oﬀs and optimal access ofweb sources. In

FOCS , pages 86–92. IEEE Computer Society, 2000.[18] M. Puterman.

Markov Decision Processes . John Wiley and Sons, 1994.[19] H. Royden.

Real analysis . Prentice Hall, 3rd edition, 12 Feb. 1988.[20] R. Szymanek, F. Catthoor, and K. Kuchcinski. Time-energy design space exploration for multi-layermemory architectures. In

DATE , pages 318–323. IEEE Computer Society, 2004.[21] P. Yang and F. Catthoor. Pareto-optimization-based run-time task scheduling for embedded systems.In R. Gupta, Y. Nakamura, A. Orailoglu, and P. H. Chou, editors,

CODES+ISSS , pages 120–125. ACM,2003.

This work is licensed under the Creative Commons Attribution-NoDerivs License. To viewa copy of this license, visit http://creativecommons.org/licenses/by-nd/2.0/http://creativecommons.org/licenses/by-nd/2.0/

Related Researches

A Generic Strategy Iteration Method for Simple Stochastic Games

by D. Auger

Best-of-Both-Worlds Fair-Share Allocations

by Moshe Babaioff

A Game Theoretic Framework for Surplus Food Distribution in Smart Cities and Beyond

by Surja Sanyal

Tractable mechanisms for computing near-optimal utility functions

by Rahul Chandan

Finding Nash Equilibria of Two-Player Games

by Bernhard von Stengel

Efficient, Fair, and Incentive-Compatible Healthcare Rationing

by Haris Aziz

A Link Diagram Visualizing Relations between Two Ordered Sets

by T. Mizuno

Smart Proofs via Smart Contracts: Succinct and Informative Mathematical Derivations via Decentralized Markets

by Sylvain Carré

Cutoff stability under distributional constraints with an application to summer internship matching

by Haris Aziz

Strategyproof Facility Location Mechanisms on Discrete Trees

by Alina Filimonov

A Multivariate Complexity Analysis of the Material Consumption Scheduling Problem

by Matthias Bentert

Classifying Convergence Complexity of Nash Equilibria in Graphical Games Using Distributed Computing Theory

by Juho Hirvonen

Are Gross Substitutes a Substitute for Submodular Valuations?

by Shahar Dozinski

Revelation Gap for Pricing from Samples

by Yiding Feng

Optimal Pricing of Information

by Shuze Liu

A Fragile multi-CPR Game

by Christos Pelekis

Convergence of Bayesian Nash Equilibrium in Infinite Bayesian Games under Discretization

by Linan Huang

Phragmén's Voting Methods and Justified Representation

by Markus Brill

Equal Affection or Random Selection: the Quality of Subjective Feedback from a Group Perspective

by Jiale Chen

Multi-Sided Matching Markets with Consistent Preferences and Cooperative Partners

by Maximilian Mordig

New Characterizations of Strategy-Proofness under Single-Peakedness

by Andrew Jennings

A Refined Complexity Analysis of Fair Districting over Graphs

by Niclas Boehmer

Mutual information-based group explainers with coalition structure for machine learning model explanations

by Alexey Miroshnikov

An agile and distributed mechanism for inter-domain network slicing in next generation mobile networks

by Jalal Khamse-Ashari

Dynamical Analysis of the EIP-1559 Ethereum Fee Market

by Stefanos Leonardos

«

1

2

3

4

»

Submitted on 18 Apr 2011 (v1), last revised 12 Feb 2014 (this version, v3) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar